R Data Analysis Projects
R Data Analysis Projects
Gopi Subramanian
BIRMINGHAM - MUMBAI
R Data Analysis Projects
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
www.packtpub.com
Credits
He has 16 patent applications with the US and Indian patent offices and several
publications to his credit. He is the author of Python Data Science Cookbook by Packt
Publishing.
I would like to thank my parents Vijaya and Mani and sister Geetha for being a great
source of inspiration. My family Anita and Rishi for their support.
About the Reviewer
Mark Hodnett is a data scientist with over 20 years of industry experience. He has worked
in a variety of industries, ranging from website development, retail loyalty, and industrial
systems to accountancy software. He holds a master's in data science and an MBA.
His current role is with AltViz as a senior data scientist. Altviz applies machine learning,
optimization, and intelligent evaluation to companies in many sectors, including retail and
insurance.
I would like to thank the author, Gopi, for giving me the opportunity to collaborate on this
project. I would also like to thank Sharon and Conor for their patience while I worked on
this project.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com. Did
you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
service@packtpub.com for more details. At www.PacktPub.com, you can also read a
collection of free technical articles, sign up for a range of free newsletters and receive
exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/1788621875.
If you'd like to join our team of regular reviewers, you can email us at
customerreviews@packtpub.com. We award our regular reviewers with free eBooks and
videos in exchange for their valuable feedback. Help us be relentless in improving our
products!
Table of Contents
Preface 1
Chapter 1: Association Rule Mining 6
Understanding the recommender systems 7
Transactions 8
Weighted transactions 9
Our web application 9
Retailer use case and data 10
Association rule mining 13
Support and confidence thresholds 25
The cross-selling campaign 30
Leverage 33
Conviction 34
Weighted association rule mining 35
Hyperlink-induced topic search (HITS) 42
Negative association rules 50
Rules visualization 53
Wrapping up 57
Summary 65
Chapter 2: Fuzzy Logic Induced Content-Based Recommendation 66
Introducing content-based recommendation 68
News aggregator use case and data 72
Designing the content-based recommendation engine 77
Building a similarity index 80
Bag-of-words 80
Term frequency 81
Document frequency 81
Inverse document frequency (IDF) 81
TFIDF 82
Why cosine similarity? 85
Searching 86
Polarity scores 89
Jaccard's distance 91
Jaccards distance/index 92
Ranking search results 94
Fuzzy logic 95
Fuzzification 95
Table of Contents
[ ii ]
Table of Contents
[ iii ]
Table of Contents
[ iv ]
Preface
This book is for readers who want to leverage the R platform for their data analysis
projects/problems. We introduce readers to different R packages for various data analysis
purposes and help them use the right package for the right task. A complete project built
from scratch in each chapter will help readers better understand building end-to-end
predictive analytics solutions. This book covers a variety of topics, including building deep
learning networks in R, graph mining, streaming data analysis, sentiment classification, and
recommender systems.
Chapter 2, Fuzzy Logic Induced Content-Based Recommendation, addresses the cold start
problem in the recommender system. We handle the ranking problem with multi-similarity
metrics using a fuzzy sets approach.
Chapter 4, Taming Time Series Data Using Deep Neural Networks, introduces MXNet R, a
package for deep learning in R. We leverage MXNet to build a deep connected network to
predict stock closing prices.
Chapter 5, Twitter Text Sentiment Classification Using Kernel Density Estimates, shows ability
to process Twitter data in R. We introduce delta-tfidf, a new metric for sentiment
classification. We leverage the kernel density estimate based Naïve Bayes algorithm to
classify sentiments.
Chapter 6, Record Linkage - Stochastic and Machine Learning Approaches, covers the problem of
master data management and how to solve it in R using the recordLinkage package.
Preface
Chapter 8, Analyzing and Understanding Networks Using R, covers the igraph package for
performing graph analysis in R. We solve product network analysis problems with graph
algorithms.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can
include other contexts through the use of the include directive."
[2]
Preface
Words that you see on the screen, for example, in menus or dialog boxes, appear in the text
like this: "Clicking the Next button moves you to the next screen."
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
[3]
Preface
1. Log in or register to our website using your email address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book from.
7. Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title. To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of the book in the
search field. The required information will appear under the Errata section.
[4]
Preface
Piracy
Piracy of copyrighted material on the internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the internet, please provide us with
the location address or website name immediately so that we can pursue a remedy. Please
contact us at copyright@packtpub.com with a link to the suspected pirated material. We
appreciate your help in protecting our authors and our ability to bring you valuable
content.
Questions
If you have a problem with any aspect of this book, you can contact us at
questions@packtpub.com, and we will do our best to address the problem.
[5]
Association Rule Mining
1
This chapter is ordered in ascending complexity. We will start with our use case, designing
a cross-sell campaign for an imaginative retailer. We will define the goals for this campaign
and the success criteria for these goals. Having defined our problem, we will proceed to our
first recommendation algorithm, association rule mining. Association rule mining, also
called market basket analysis, is a method used to analyze transaction data to extract
product associations.
The subsequent sections are devoted to unveiling the plain vanilla version of association
rule mining and introducing some of the interest measures. We will then proceed to find
ways to establish the minimum support and confidence thresholds, the major two interest
measures, and also the parameters of the association rule mining algorithm. We will explore
more interest measures toward the end, such as lift and conviction, and look at how we can
use them to generate recommendations for our retailer's cross-selling campaign.
We will introduce a variation of the association rule mining algorithm, called the weighted
association rule mining algorithm, which can incorporate some of the retailer input in the
form of weighted transactions. The profitability of a transaction is treated as a weight. In
addition to the products in a transaction, the profitability of the transaction is also recorded.
Now we have a smarter algorithm that can produce the most profitable product
associations.
Association Rule Mining Chapter 1
We will then introduce the (HITS) algorithm. In places where a retailer's weight input is
not available, namely, when there is no explicit information about the importance of the
transactions, HITS provides us with a way to generate weights (importance) for the
transactions.
Next, we will introduce a variation of association rule mining called negative association
rule mining, which is an efficient algorithm used to find anti-patterns in the transaction
database. In cases where we need to exclude certain items from our analysis (owing to low
stock or other constraints), negative association rule mining is the best method. Finally, we
will wrap up this chapter by introducing package arulesViz: an R package with some cool
charts and graphics to visualize the association rules, and a small web application designed
to report our analysis using the RShiny R package.
[7]
Association Rule Mining Chapter 1
In the last decade, recommender systems have achieved great success with both online
retailers and brick and mortar stores. They have allowed retailers to move away from group
campaigns, where a group of people receive a single offer. Recommender systems
technology has revolutionized marketing campaigns. Today, retailers offer a customized
recommendation to each of their customers. Such recommendations can dramatically
increase customer stickiness.
Retailers design and run sales campaigns to promote up-selling and cross-selling. Up-
selling is a technique by which retailers try to push high-value products to their customers.
Cross-selling is the practice of selling additional products to customers. Recommender
systems provide an empirical method to generate recommendations for retailers up-selling
and cross-selling campaigns.
Retailers can now make quantitative decisions based on solid statistics and math to improve
their businesses. There are a growing number of conferences and journals dedicated to
recommender systems technology, which plays a vital role today at top successful
companies such as Amazon.com, YouTube, Netflix, LinkedIn, Facebook, TripAdvisor, and
IMDb.
Based on the type and volume of available data, recommender systems of varying
complexity and increased accuracy can be built. In the previous paragraph, we defined
historical data as a user and his product interactions. Let's use this definition to illustrate the
different types of data in the context of recommender systems.
Transactions
Transactions are purchases made by a customer on a single visit to a retail store. Typically,
transaction data can include the products purchased, quantity purchased, the price,
discount if applied, and a timestamp. A single transaction can include multiple products. It
may register information about the user who made the transaction in some cases, where the
customer allows the retailer to store his information by joining a rewards program.
[8]
Association Rule Mining Chapter 1
A simplified view of the transaction data is a binary matrix. The rows of this matrix
correspond to a unique identifier for a transaction; let's call it transaction ID. The columns
correspond to the unique identifier for a product; let's call it product ID. The cell values are
zero or one, if that product is either excluded or included in the transaction.
Txn/Product P1 P2 P3
.... Pm
T1
0 1 1 ... 0
T2
1 1 1 .... 1
... ... ... ... ... ...
Tn
o 1 1 ... 1
Weighted transactions
This is additional information added to the transaction to denote its importance, such as the
profitability of the transaction as a whole or the profitability of the individual products in
the transaction. In the case of the preceding binary matrix, a column called weight is added
to store the importance of the transaction.
In this chapter, we will show you how to use transaction data to support cross-selling
campaigns. We will see how the derived user product preferences, or recommendations
from the user's product interactions (transactions/weighted transactions), can fuel
successful cross-selling campaigns. We will implement and understand the algorithms that
can leverage this data in R. We will work on a superficial use case in which we need to
generate recommendations to support a cross-selling campaign for an imaginative retailer.
[9]
Association Rule Mining Chapter 1
Our goal, by the the end of this chapter, is to understand the concepts of association rule
mining and related topics, and solve the given cross-selling campaign problem using
association rule mining. We will understand how different aspects of the cross-selling
campaign problem can be solved using the family of association rule mining algorithms,
how to implement them in R, and finally build the web application to display our analysis
and results.
We will be following a code-first approach in this book. The style followed throughout this
book is to introduce a real-world problem, following which we will briefly introduce the
algorithm/technique that can be used to solve this problem. We will keep the algorithm
description brief. We will proceed to introduce the R package that implements the
algorithm and subsequently start writing the R code to prepare the data in a way that the
algorithm expects. As we invoke the actual algorithm in R and explore the results, we will
get into the nitty-gritty of the algorithm. Finally, we will provide further references for
curious readers.
He has provided us with his historical transaction data. The data includes his past
transactions, where each transaction is uniquely identified by an order_id integer, and the
list of products present in the transaction called product_id. Remember the binary matrix
representation of transactions we described in the introduction section? The dataset
provided by our retailer is in exactly the same format.
[ 10 ]
Association Rule Mining Chapter 1
Let's start by reading the data provided by the retailer. The code for this chapter was
written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our
example, we will introduce the arules R package that we will be using. In our description,
we will be using the terms order/transaction, user/customer, and item/product
interchangeably. We will not be describing the installation of R packages used through the
code. It's assumed that the reader is well aware of the method for installing an R package
and has installed those packages before using it.
The given data is in a tabular format. Every row is a tuple of order_id, representing the
transaction, product_id, the item included in that transaction, and the department_id,
that is the department to which the item belongs to. This is our binary data representation,
which is absolutely tenable for the classical association rule mining algorithm. This
algorithm is also called market basket analysis, as we are analyzing the customer's
basket—the transactions. Given a large database of customer transactions where each
transaction consists of items purchased by a customer during a visit, the association rule
mining algorithm generates all significant association rules between the items in the
database.
[ 11 ]
Association Rule Mining Chapter 1
Let's quickly explore our data. We can count the number of unique transactions and the
number of unique products:
library(dplyr)
data %>%
group_by('order_id') %>%
summarize(order.count = n_distinct(order_id))
data %>%
group_by('product_id') %>%
summarize(product.count = n_distinct(product_id))
# A tibble: 1 <U+00D7> 2
`"order_id"` order.count
<chr> <int>
1 order_id 6988
# A tibble: 1 <U+00D7> 2
`"product_id"` product.count
<chr> <int>
1 product_id 16793
We have 6988 transactions and 16793 individual products. There is no information about
the quantity of products purchased in a transaction. We have used the dplyr library to
perform these aggregate calculations, which is a library used to perform efficient data
wrangling on data frames.
[ 12 ]
Association Rule Mining Chapter 1
In the next section, we will introduce the association rule mining algorithm. We will explain
how this method can be leveraged to generate the top N product recommendations
requested by the retailer for his cross-selling campaign.
We will explain the association rule mining algorithm and the effect of the interest measures
on the algorithm as we write our R code. We will halt our code writing in the required
places to get a deeper understanding of how the algorithm works, the algorithm
terminology such as itemsets, and how to leverage the interest measures to our benefit to
support the cross-selling campaign.
We will use the arules package, version 1.5-0, to help us perform association mining on
this dataset.
library(arules)
transactions.obj <- read.transactions(file = data.path, format = "single",
sep = ",",
cols = c("order_id", "product_id"),
rm.duplicates = FALSE,
quote = "", skip = 0,
encoding = "unknown")
[ 13 ]
Association Rule Mining Chapter 1
We begin with reading our transactions stored in the text file and create an arules data
structure called transactions. Let's look at the parameters of read.transactions, the
function used to create the transactions object. For the first parameter, file, we pass our
text file where we have the transactions from the retailer. The second parameter, format,
can take any of two values, single or basket, depending on how the input data is
organized. In our case, we have a tabular format with two columns--one column
representing the unique identifier for our transaction and the other column for a unique
identifier representing the product present in our transaction. This format is named single
by arules. Refer to the arules documentation for a detailed description of all the
parameters.
We can see that there are 6988 transactions and 16793 products. They match the previous
count values from the dplyr output.
Let's explore this transaction object a little bit more. Can we see the most frequent items,
that is, the items that are present in most of the transactions and vice versa—the least
frequent items and the items present in many fewer transactions?
The itemFrequency function in the arules package comes to our rescue. This function
takes a transaction object as input and produces the frequency count (the number of
transactions containing this product) of the individual products:
data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = TRUE), 10)) # Most frequent
Banana 1456
Bag of Organic Bananas 1135
Organic Strawberries 908
Organic Baby Spinach 808
Organic Hass Avocado 729
Organic Avocado 599
Large Lemon 534
Limes 524
Organic Raspberries 475
Organic Garlic 432
head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = FALSE), 10) # Least frequent
0% Fat Black Cherry Greek Yogurt y 1
0% Fat Blueberry Greek Yogurt 1
[ 14 ]
Association Rule Mining Chapter 1
In the preceding code, we print the most and the least frequent items in our database using
the itemFrequency function. The itemFrequency function produces all the items with
their corresponding frequency, and the number of transactions in which they appear. We
wrap the sort function over itemFrequency to sort this output; the sorting order is
decided by the decreasing parameter. When set to TRUE, it sorts the items in descending
order based on their transaction frequency. We finally wrap the sort function using
the head function to get the top 10 most/least frequent items.
The Banana product is the most frequently occurring across 1,456 transactions. The
itemFrequency method can also return the percentage of transactions rather than an
absolute number if we set the type parameter to relative instead of absolute.
Another convenient way to inspect the frequency distribution of the items is to plot them
visually as a histogram. The arules package provides the itemFrequencyPlot function
to visualize the item frequency:
itemFrequencyPlot(transactions.obj,topN = 25)
In the preceding code, we plot the top 25 items by their relative frequency, as shown in the
following diagram:
[ 15 ]
Association Rule Mining Chapter 1
As per the figure, Banana is the most frequent item, present in 20 percent of the
transactions. This chart can give us a clue about defining the support value for our
algorithm, a concept we will quickly delve into in the subsequent paragraphs.
Now that we have successfully created the transaction object, let's proceed to apply the
apriori algorithm to this transaction object.
The apriori algorithm works in two phases. Finding the frequent itemsets is the first
phase of the association rule mining algorithm. A group of product IDs is called an itemset.
The algorithm makes multiple passes into the database; in the first pass, it finds out the
transaction frequency of all the individual items. These are itemsets of order 1. Let's
introduce the first interest measure, Support, here:
[ 16 ]
Association Rule Mining Chapter 1
Now, in the first pass, the algorithm calculates the transaction frequency for each product.
At this stage, we have order 1 itemsets. We will discard all those itemsets that fall below
our support threshold. The assumption here is that items with a high transaction frequency
are more interesting than the ones with a very low frequency. Items with very low support
are not going to make for interesting rules further down the pipeline. Using the most
frequent items, we can construct the itemsets as having two products and find their
transaction frequency, that is, the number of transactions in which both the items are
present. Once again, we discard all the two product itemsets (itemsets of order 2) that are
below the given support threshold. We continue this way until we have exhausted them.
Pass 1 :
Support = 0.1
Product, transaction frequency
{item5}, 0.4
{item6}, 0.3
{item9}, 0.2
{item11}, 0.1
item11 will be discarded in this phase, as its transaction frequency is below the support
threshold.
Pass 2:
{item5, item6}
{item5, item9}
{item6, item9}
As you can see, we have constructed itemsets of order 2 using the filtered items from pass
1. We proceed to find their transaction frequency, discard the itemsets falling below our
minimum support threshold, and step on to pass 3, where once again we create itemsets of
order 3, calculate the transaction frequency, and perform filtering and move on to pass 4. In
one of the subsequent passes, we reach a stage where we cannot create higher order
itemsets. That is when we stop:
# Interest Measures
support <- 0.01
# Frequent item sets
parameters = list(
support = support,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "frequent itemsets")
freq.items <- apriori(transactions.obj, parameter = parameters)
[ 17 ]
Association Rule Mining Chapter 1
The apriori method is used in arules to get the most frequent items. This method takes
two parameters, the transaction.obj and the second parameter, which is a named list.
We create a named list called parameters. Inside the named list, we have an entry for our
support threshold. We have set our support threshold to 0.01, namely, one percent of the
transaction. We settled at this value by looking at the histogram we plotted earlier. By
setting the value of the target parameter to frequent itemsets, we specify that we
expect the method to return the final frequent itemsets. Minlen and maxlen set lower and
upper cut off on how many items we expect in our itemsets. By setting our minlen to 2, we
say we don't want itemsets of order 1. While explaining the apriori in phase 1, we said
that the algorithm can do many passes into the database, and each subsequent pass creates
itemsets that are of order 1, greater than the previous pass. We also said apriori ends
when no higher order itemsets can be found. We don't want our method to run till the end,
hence by using maxlen, we say that if we reach itemsets of order 10, we stop. The
apriori function returns an object of type itemsets.
It's good practice to examine the created object, itemset in this case. A closer look at the
itemset object should shed light on how we ended up using its properties to create our
data frame of itemsets:
str(freq.items)
Formal class 'itemsets' [package "arules"] with 4 slots
..@ items :Formal class 'itemMatrix' [package "arules"] with 3 slots
.. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
.. .. .. .. ..@ i : int [1:141] 1018 4107 4767 11508 4767 6543 4767 11187
4767 10322 ...
.. .. .. .. ..@ p : int [1:71] 0 2 4 6 8 10 12 14 16 18 ...
.. .. .. .. ..@ Dim : int [1:2] 14286 70
.. .. .. .. ..@ Dimnames:List of 2
.. .. .. .. .. ..$ : NULL
.. .. .. .. .. ..$ : NULL
.. .. .. .. ..@ factors : list()
.. .. ..@ itemInfo :'data.frame': 14286 obs. of 1 variable:
.. .. .. ..$ labels: chr [1:14286] "10" "1000" "10006" "10008" ...
.. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
..@ tidLists: NULL
..@ quality :'data.frame': 70 obs. of 1 variable:
.. ..$ support: num [1:70] 0.0108 0.0124 0.0124 0.0154 0.0122 ...
..@ info :List of 4
.. ..$ data : symbol transactions.obj
.. ..$ ntransactions: int 4997
.. ..$ support : num 0.01
.. ..$ confidence : num 1
[ 18 ]
Association Rule Mining Chapter 1
item_set support
1 {Banana,Red Vine Tomato} 0.01030338
2 {Bag of Organic Bananas,Organic D'Anjou Pears} 0.01001717
3 {Bag of Organic Bananas,Organic Kiwi} 0.01016027
4 {Banana,Organic Gala Apples} 0.01073268
5 {Banana,Yellow Onions} 0.01302232
tail(freq.items.df)
item_set support
79 {Organic Baby Spinach,Organic Strawberries} 0.02575844
80 {Bag of Organic Bananas,Organic Baby Spinach} 0.02690326
81 {Banana,Organic Baby Spinach} 0.03048082
82 {Bag of Organic Bananas,Organic Strawberries} 0.03577562
83 {Banana,Organic Strawberries} 0.03305667
We create our data frame using these two lists of values. Finally, we use the head and tail
functions to quickly look at the itemsets present at the top and bottom of our data frame.
Before we move on to phase two of the association mining algorithm, let's stop for a
moment to investigate a quick feature present in the apriori method. If you had noticed,
our itemsets consist mostly of the high frequency itemsets. However, we can ask the
apriori method to ignore some of the items:
[ 19 ]
Association Rule Mining Chapter 1
As you can see, we have excluded Banana and Bag of Organic Bananas from our
itemsets.
Congratulations, you have successfully finished implementing the first phase of the
apriori algorithm in R!
Let's move on to phase two, where we will induce rules from these itemsets. It's time to
introduce our second interest measure, confidence. Let's take an itemset from the list given
to us from phase one of the algorithm, {Banana, Red Vine Tomato}.
[ 20 ]
Association Rule Mining Chapter 1
How often are these two rules found to be true in our database? The confidence score, our
next interest measure, will help us measure this:
Confidence: For the rule Banana => Red Vine Tomato, the confidence is
measured as the ratio of support for the itemset {Banana,Red Vine Tomato}
and the support for the itemset {Banana}. Let's decipher this formula. Let's say
that item Banana has occurred in 20 percent of the transactions, and it has
occurred in 10 percent of transactions along with Red Vine Tomato, so the
support for the rule is 50 percent, 0.1/0.2.
Similar to support, the confidence threshold is also provided by the user. Only those rules
whose confidence is greater than or equal to the user-provided confidence will be included
by the algorithm.
Let's say we have a rule, A => B, induced from itemset <A, B>. The support
for the itemset <A,B> is the number of transactions where A and B occur
divided by the total number of the transactions. Alternatively, it's the
probability that a transaction contains <A, B>.
Now, confidence for A => B is P (B | A); which is the conditional
probability that a transaction containing B also has A? P(B | A) = P ( A
U B) / P (A) = Support ( A U B) / Support (A).
Remember from probability theory that when two events are independent of
each other, the joint probability is calculated by multiplying the individual
probability. We will use this shortly in our next interest measure lift.
With this knowledge, let's go ahead and implement phase two of our apriori algorithm in
R:
> confidence <- 0.4 # Interest Measure
parameters = list(
support = support,
confidence = confidence,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "rules"
)
rules <- apriori(transactions.obj, parameter = parameters)
rules.df <- data.frame(rules = labels(rules)
,rules@quality)
[ 21 ]
Association Rule Mining Chapter 1
Once again, we use the apriori method; however, we set the target parameter in
our parameters named list to rules. Additionally, we also provide the confidence
threshold. After calling the method apriori using the returned object rules, we finally
build our data frame, rules.df, to explore/view our rules conveniently. Let's look at our
output data frame, rules.df. For the given confidence threshold, we can see the set of
rules thrown out by the algorithm:
head(rules.df)
rules support confidence lift
1 {Red Vine Tomato} => {Banana} 0.01030338 0.4067797 1.952319
2 {Honeycrisp Apple} => {Banana} 0.01617058 0.4248120 2.038864
3 {Organic Fuji Apple} => {Banana} 0.01817401 0.4110032 1.972590
Lift: Often, we may have rules with high support and confidence but that are, as
yet, of no use. This occurs when the item at the right-hand side of the rule has
more probability of being selected alone than with the associated item. Let's go
back to a grocery example. Say there are 1,000 transactions. 900 of those
transactions contain milk. 700 of them contain strawberries. 300 of them have
both strawberries and milk. A typical rule that can come out of such a scenario is
strawberry => milk, with a confidence of 66 percent. The support
(strawberry, milk) / support ( strawberry). This is not accurate and
it's not going to help the retailer at all, as the probability of people buying milk is
90 percent, much higher than the 66 percent given by this rule.
This is where lift comes to our rescue. For two products, A and B, lift measures how many
times A and B occur together, more often than expected if they were statistically
independent. Let's calculate the lift value for this hypothetical example:
lift ( strawberry => milk ) = support ( strawberry, milk) / support(
strawberry) * support (milk)
= 0.3 / (0.9)(0.7)
= 0.47
[ 22 ]
Association Rule Mining Chapter 1
There are tons of other interest measures available in arules. They can be obtained by
calling the interestMeasure function. We show how we can quickly populate these
measures into our rules data frame:
interestMeasure(rules, transactions = transactions.obj)
rules.df <- cbind(rules.df, data.frame(interestMeasure(rules,
transactions = transactions.obj)))
We will not go through all of them here. There is a ton of literature
available to discuss these measures and their use in filtering most useful
rules.
Alright, we have successfully implemented our association rule mining algorithm; we went
under the hood to understand how the algorithm works in two phases to generate rules. We
have examined three interest measures: support, confidence, and lift. Finally, we
know that lift can be leveraged to make cross-selling recommendations to our retail
customers. Before we proceed further, let's look at some more functions in the arules
package that will allow us to do some sanity checks on the rules. Let us start with
redundant rules:
is.redundant(rules) # Any redundant rules ?
Rules 1 and Rule 2 share the same right-hand side, Banana. The left-hand side of Rule
2 has a subset of items from Rule 1. The confidence score of Rule 2 is higher than Rule
1. In this case, Rule 2 is considered a more general rule than Rule 1 and hence Rule 2
will be considered as a redundant rule and removed. Another sanity check is to find
duplicated rules:
To remove rules with the same left- and right-hand side, the duplicated function comes in
handy. Let's say we create two rules objects, using two sets of support and confidence
thresholds. If we combine these two rules objects, we may have a lot of duplicate rules. The
duplicated function can be used to filter these duplicate rules. Let us now look at the
significance of a rule:
is.significant(rules, transactions.obj, method = "fisher")
[ 23 ]
Association Rule Mining Chapter 1
Given the rule A => B, we explained that lift calculates how many times A and B occur
together more often than expected. There are other ways of testing this independence, like a
chi-square test or a Fisher's test. The arules package provides the is.significant
method to do a Fisher or a chi-square test of independence. The parameter method can
either take the value of fisher or chisq depending on the test we wish to perform. Refer
to the arules documentation for more details. Finally let us see the list of transactions
where these rules are supported:
>as(supportingTransactions(rules, transactions.obj), "list")
Here is the complete R script we have used up until now to demonstrate how to leverage
arules to extract frequent itemsets, induce rules from those itemsets, and filter the rules
based on interest measures.
[ 24 ]
Association Rule Mining Chapter 1
transactions.obj
# Item frequency
head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = TRUE), 10) # Most frequent
head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = FALSE), 10) # Least frequent
itemFrequencyPlot(transactions.obj,topN = 25)
# Interest Measures
support <- 0.01
# Frequent item sets
parameters = list(
support = support,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "frequent itemsets"
)
freq.items <- apriori(transactions.obj, parameter = parameters)
# Let us examine our freq item sites
freq.items.df <- data.frame(item_set = labels(freq.items)
, support = freq.items@quality)
head(freq.items.df, 5)
tail(freq.items.df, 5)
# Let us now examine the rules
confidence <- 0.4 # Interest Measure
parameters = list(
support = support,
confidence = confidence,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "rules"
)
rules <- apriori(transactions.obj, parameter = parameters)
rules.df <- data.frame(rules = labels(rules)
,rules@quality)
interestMeasure(rules, transactions = transactions.obj)
[ 25 ]
Association Rule Mining Chapter 1
library(ggplot2)
library(arules)
[ 26 ]
Association Rule Mining Chapter 1
[ 27 ]
Association Rule Mining Chapter 1
head(no.rules.df) ##
In the preceding code, we show how we can wrap what we have discussed so far into some
functions to explore the support and confidence values. The get.rules function
conveniently wraps the rule generation functionality. For a given support, confidence, and
threshold, this function will return the rules induced by the association rule mining
algorithm. The explore.parameters function creates a grid of support and confidence
values. You can see that support values are in the range of 0.001 to 0.1, that is, we are
exploring the space where items are present in 0.1 percent of the transactions up to 10
percent of the transactions. Finally, the function evaluates the number of rules generated for
each support confidence pair in the grid. The function nicely wraps the results in a data
frame. For very low support values, we see an extremely large number of rules being
generated. Most of them would be spurious rules:
support confidence no.rules
1 0.001 0.05 23024
2 0.002 0.05 4788
3 0.003 0.05 2040
[ 28 ]
Association Rule Mining Chapter 1
Analysts can take a look at this data frame, or alternatively plot it to decide the right
minimum support/confidence threshold. Our get.plots function plots a number of graphs
for different values of confidence. Here is the line plot of the number of rules generated for
various support values, keeping the confidence fixed at 0.1:
The preceding plot can be a good guideline for selecting the support value. We generated
the preceding plot by fixing the confidence at 10 percent. You can experiment with different
values of confidence. Alternatively, use the get.plots function.
[ 29 ]
Association Rule Mining Chapter 1
The paper, Mining the Most Interesting Rules, by Roberto J. Bayardo Jr. and
Rakesh Agrawal demonstrates that the most interesting rules are found on
the support/confidence border when we plot the rules with support and
confidence on the x and y axes respectively. They call these rules SC-
Optimal rules.
library(arules)
library(igraph)
[ 30 ]
Association Rule Mining Chapter 1
rm.duplicates = FALSE,
quote = "", skip = 0,
encoding = "unknown")
return(transactions.obj)
}
[ 31 ]
Association Rule Mining Chapter 1
return(best.rules.df)
}
plot.graph(cross.sell.rules)
After exploring the dataset for support and confidence values, we set the support and
confidence values as 0.001 and 0.2 respectively.
We have written a function called find.rules. It internally calls get.rules. This function
returns the list of top N rules given the transaction and support/confidence thresholds. We
are interested in the top 10 rules. As discussed, we are going to use lift values for our
recommendation. The following are our top 10 rules:
rules support confidence lift conviction leverage
59 {Organic Hass Avocado} => {Bag of Organic Bananas} 0.03219805 0.3086420
1.900256 1.211498 0.01525399
63 {Organic Strawberries} => {Bag of Organic Bananas} 0.03577562 0.2753304
[ 32 ]
Association Rule Mining Chapter 1
The first entry has a lift value of 1.9, indicating that the products are not independent. This
rule has a support of 3 percent and the system has 30 percent confidence for this rule. We
recommend that the retailer uses these two products in his cross-selling campaign as, given
the lift value, there is a high probability of the customer picking up a {Bag of Organic
Bananas} if he picks up an {Organic Hass Avocado}.
Curiously, we have also included two other interest measures—conviction and leverage.
Leverage
How many more units of A and B are expected to be sold together than expected from
individual sales? With lift, we said that there is a high association between the {Bag of
Organic Bananas} and {Organic Hass Avocado} products. With leverage, we are able
to quantify in terms of sales how profitable these two products would be if sold together.
The retailer can expect 1.5 more unit sales by selling the {Bag of Organic Bananas} and
the {Organic Hass Avocado} together rather than selling them individually. For a given
rule A => B:
Leverage(A => B) = Support(A => B) - Support(A)*Support(B)
Leverage measures the difference between A and B appearing together in the dataset and
what would be expected if A and B were statistically dependent.
[ 33 ]
Association Rule Mining Chapter 1
Conviction
Conviction is a measure to ascertain the direction of the rule. Unlike lift, conviction is
sensitive to the rule direction. Conviction (A => B) is not the same as conviction (B => A).
Conviction, with the sense of its direction, gives us a hint that targeting the customers of
Organic Hass Avocado to cross-sell will yield more sales of Bag of Organic Bananas rather
than the other way round.
Thus, using lift, leverage, and conviction, we have provided all the empirical details to our
retailer to design his cross-selling campaign. In our case, we have recommended the top 10
rules to the retailer based on leverage. To provide the results more intuitively and to
indicate what items could go together in a cross-selling campaign, a graph visualization of
the rules can be very appropriate.
The plot.graph function is used to visualize the rules that we have shortlisted based on
their leverage values. It internally uses a package called igraph to create a graph
representation of the rules:
[ 34 ]
Association Rule Mining Chapter 1
Our suggestion to the retailer can be the largest subgraph on the left. Items in that graph
can be leveraged for his cross-selling campaign. Depending on the profit margin and other
factors, the retailer can now design his cross-selling campaign using the preceding output.
Our output up until now has been great. How can I add some additional lists of products
to my campaign?
You are shocked; the data scientist in you wants to do everything empirically. Now the
retailer is asking for a hard list of products to be added to the campaign. How do you fit
them in?
The analysis output does not include these products. None of our top rules recommend
these products. They are not very frequently sold items. Hence, they are not bubbling up in
accordance with the rules.
The retailer is insistent: "The products I want to add to the campaign are high margin items.
Having them in the campaign will boost my yields."
Voila! The retailer is interested in high-margin items. Let's pull another trick of the
trade—weighted association rule mining.
Jubilant, you reply, "Of course I can accommodate these items. Not just them, but also your
other high-valued items. I see that you are interested in your margins; can you give me the
margin of all your transactions? I can redo the analysis with this additional information and
provide you with the new results. Shouldn't take much time."
The retailer is happy. "Of course; here is my margin for the transactions."
Let's introduce weighted association rule mining with an example from the seminal
paper, Weighted Association Rules: Models and Algorithms by G.D.Ramkumar et al.
[ 35 ]
Association Rule Mining Chapter 1
Caviar is an expensive and hence a low support item in any supermarket basket. Vodka,
on the other hand, is a high to medium support item. The association, caviar => vodka is
of very high confidence but will never be derived by the existing methods as the {caviar,
vodka} itemset is of low support and will not be included.
The preceding paragraph echoes our retailer's concern. With the additional information
about the margin for our transactions, we can now use weighted association rule mining to
arrive at our new set of recommendations:
"transactionID","weight"
"1001861",0.59502283788534
"1003729",0.658379205926458
"1003831",0.635451491097042
"1003851",0.596453384749423
"1004513",0.558612727312164
"1004767",0.557096300448959
"1004795",0.693775098285732
"1004797",0.519395513963845
"1004917",0.581376662057422
library(arules)
library(igraph)
[ 36 ]
Association Rule Mining Chapter 1
#
# Returns:
# transaction object
transactions.obj <- read.transactions(file = data.path, format = "single",
sep = ",",
cols = columns,
rm.duplicates = FALSE,
quote = "", skip = 0,
encoding = "unknown")
return(transactions.obj)
}
parameters = list(
support = support,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "frequent itemsets"
)
weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
[ 37 ]
Association Rule Mining Chapter 1
head(weclat.itemsets.df)
tail(weclat.itemsets.df)
# Rule induction
weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj,
confidence = 0.3)
weclat.rules.df <-data.frame(rules = labels(weclat.rules)
, weclat.rules@quality)
head(weclat.rules.df)
In the arules package, the weclat method allows us to use weighted transactions to
generate frequent itemsets based on these weights. We introduce the weights through the
itemsetinfo data frame in the str(transactions.obj) transactions object:
The third slot in the transaction object is a data frame with one column, transactionID.
We create a new column called weight in that data frame and push our transaction
weights:
weights <- read.csv("../../data/weights.csv")
transactions.obj@itemsetInfo <- weights
str(transactions.obj)
[ 38 ]
Association Rule Mining Chapter 1
In the preceding case, we have replaced the whole data frame. You can either do that or
only add the weight column.
We have the transactionID and the weight in the itemsetInfo data frame now. Let's
run the weighted itemset generation using these transaction weights:
support <- 0.01
parameters = list(
support = support,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "frequent itemsets"
)
Once again, we invoke the weclat function with the parameter list and the transactions
object. As the itemInfo data frame has the weight column, the function calculates the
support using the weights provided. The new definition of support is as follows:
[ 39 ]
Association Rule Mining Chapter 1
The weighted support of an itemset is the sum of the weights of the transactions that
contain the itemset. An itemset is frequent if its weighted support is equal to or greater than
the threshold specified by support (assuming that the weights, sum is equal to one).
With this new definition, you can see now that low support times established by the old
definition of support, if present in high value transactions, will be included. We have
automatically taken care of our retailer's request to include high margin items while
inducing the rules. Once again, for better reading, we create a data frame where each row is
the frequent itemset generated, and a column to indicate
the head(weclat.itemsets.df)support value:
weclat.itemsets support
1 {Bag of Organic Bananas,Organic Kiwi} 0.01041131
2 {Bag of Organic Bananas,Organic D'Anjou Pears} 0.01042194
3 {Bag of Organic Bananas,Organic Whole String Cheese} 0.01034432
4 {Organic Baby Spinach,Organic Small Bunch Celery} 0.01039107
5 {Bag of Organic Bananas,Organic Small Bunch Celery} 0.01109455
6 {Banana,Seedless Red Grapes} 0.01274448
tail(weclat.itemsets.df)
weclat.itemsets support
77 {Banana,Organic Hass Avocado} 0.02008700
78 {Organic Baby Spinach,Organic Strawberries} 0.02478094
79 {Bag of Organic Bananas,Organic Baby Spinach} 0.02743582
80 {Banana,Organic Baby Spinach} 0.02967578
81 {Bag of Organic Bananas,Organic Strawberries} 0.03626149
82 {Banana,Organic Strawberries} 0.03065132
In the case of apriori, we used the same function to generate/induce the rules. However,
in the case of weighted association rule mining, we need to call the ruleInduction
function to generate rules. We pass the frequent itemsets from the previous step, the
transactions object, and finally the confidence threshold. Once again, for our convenience,
we create a data frame with the list of all the rules that are induced and their interest
measures:
weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence
= 0.3)
weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
, weclat.rules@quality)
head(weclat.rules.df)
[ 40 ]
Association Rule Mining Chapter 1
Finally, let's use the plot.graph function to view the new set of interesting item
associations:
[ 41 ]
Association Rule Mining Chapter 1
Our new recommendation now includes some of the rare items. It is also sensitive to the
profit margin of individual transactions. With these recommendations, the retailer is geared
toward increasing his profitability through the cross-selling campaign.
The arules package provides a method called HITS to help us do the exact same
thing—infer transaction weights. HITS stands for Hyperlink-induced Topic Search—a link
analysis algorithm used to rate web pages developed by John Kleinberg. According to
HITS, hubs are pages with large numbers of out degrees and authority are pages with large
numbers of in degrees.
The rationale is that if a lot of pages link to a particular page, then that page is considered
an authority. If a page has links to a lot of other pages, it is considered a hub.
[ 42 ]
Association Rule Mining Chapter 1
The basic idea behind using the HITS algorithm for association rule mining is that frequent
items may not be as important as they seem. The paper presents an algorithm that applies
the HITS methodology to bipartite graphs.
According to the HITS modified for the transactions database, the transactions and
products are treated as a bipartite graph, with an arc going from the transaction to the
product if the product is present in the transaction. The following diagram is reproduced
from the paper, Mining Weighted Association Rules without Preassigned Weights by Ke Sun and
Fengshan Bai, to illustrate how transactions in a database are converted to a bipartite graph:
[ 43 ]
Association Rule Mining Chapter 1
In this representation, the transactions can be ranked using the HITS algorithm. In this kind
of representation, the support of an item is proportional to its degree. Consider item A. Its
absolute support is 4; in the graph, the in degree of A is four. As you can see, considering
only the support, we totally ignore transaction importance, unless the importance of the
transactions is explicitly provided. How do we get the transaction weights in this scenario?
Again, a way to get the weights intuitively is from a good transaction, which should be
highly weighted and should contain many good items. A good item should be contained by
many good transactions. By treating transactions as hubs and the products as authorities,
the algorithm invokes HITS on this bipartite graph.
The arules package provides the method (HITS), which implements the algorithm that we
described earlier:
weights.vector <- hits( transactions.obj, type = "relative")
weights.df <- data.frame(transactionID = labels(weights.vector), weight =
weights.vector)
We invoke the algorithm using the HITS method. We have described the intuition behind
using the HITS algorithm in transaction databases to give weights to our transactions. We
will briefly describe how the HITS algorithm functions. Curious readers can refer
to Authoritative Sources in a Hyperlinked Environment, J.M. Kleinberg, J. ACM, vol. 46, no. 5, pp.
604-632, 1999, for a better understanding of the HITS algorithm.
The HITS algorithm, to begin with, initializes the weight of all nodes to one; in our case, the
items and the transactions are set to one. That is, we maintain two arrays, one for hub
weights and the other one for authority weights.
It proceeds to do the following three steps in an iterative manner, that is, until our hub and
authority arrays stabilize or don't change with subsequent updates:
1. Authority node score update: Modify the authority score of each node to the sum
of the hub scores of each node that points to it.
2. Hub node score update: Change the hub score of each node to the sum of the
authority scores of each node that it points to.
3. Unit normalize the hub and authority scores: Continue with the authority node
score update until the hub and authority value stabilizes.
[ 44 ]
Association Rule Mining Chapter 1
At the end of the algorithm, every item has an authority score, which is the sum of the hub
scores of all the transactions that contain this item. Every transaction has a hub score that is
the sum of the authority score of all the items in that transaction. Using the weights created
using the HITS algorithm, we create a weights.df data frame:
head(weights.df)
transactionID weight
1000431 1000431 1.893931e-04
100082 100082 1.409053e-04
1000928 1000928 2.608214e-05
1001517 1001517 1.735461e-04
1001650 1001650 1.184581e-04
1001934 1001934 2.310465e-04
Pass weights.df our transactions object. We can now generate the weighted
association rules:
)
weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
, weclat.itemsets@quality)
We can look into the output data frames created for frequent itemsets and
rules:
head(weclat.itemsets.df)
weclat.itemsets support
1 {Banana,Russet Potato} 0.01074109
2 {Banana,Total 0% Nonfat Greek Yogurt} 0.01198206
3 {100% Raw Coconut Water,Bag of Organic Bananas} 0.01024201
4 {Organic Roasted Turkey Breast,Organic Strawberries} 0.01124278
5 {Banana,Roma Tomato} 0.01089124
6 {Banana,Bartlett Pears} 0.01345293
[ 45 ]
Association Rule Mining Chapter 1
tail(weclat.itemsets.df)
weclat.itemsets support
540 {Bag of Organic Bananas,Organic Baby Spinach,Organic Strawberries}
0.02142840
541 {Banana,Organic Baby Spinach,Organic Strawberries} 0.02446832
542 {Bag of Organic Bananas,Organic Baby Spinach} 0.06536606
543 {Banana,Organic Baby Spinach} 0.07685530
544 {Bag of Organic Bananas,Organic Strawberries} 0.08640422
545 {Banana,Organic Strawberries} 0.08226264
head(weclat.rules.df)
Based on the weights generated using the HITS algorithm, we can now order the items by
their authority score. This is an alternate way of ranking the items in addition to ranking
them by their frequency. We can leverage the itemFrequency function to generate the item
scores:
freq.weights <- head(sort(itemFrequency(transactions.obj, weighted =
TRUE),decreasing = TRUE),20)
freq.nweights <- head(sort(itemFrequency(transactions.obj, weighted =
FALSE),decreasing = TRUE),20)
[ 46 ]
Association Rule Mining Chapter 1
The column score gives the relative transaction frequency. The score.new column is the
authority score from the hits algorithm. You can see that Limes and Large Lemon have
interchanged places. Strawberries has gone further up the order compared to the original
transaction frequency score.
[ 47 ]
Association Rule Mining Chapter 1
library(arules)
[ 48 ]
Association Rule Mining Chapter 1
weights.vector)
head(weights.df)
)
weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
, weclat.itemsets@quality)
head(weclat.itemsets.df)
tail(weclat.itemsets.df)
## Rule induction
weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj,
confidence = 0.3)
weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
, weclat.rules@quality)
head(weclat.rules.df)
[ 49 ]
Association Rule Mining Chapter 1
There are several algorithms proposed to induce negative association rules from a
transaction database. Apriori is not a well-suited algorithm for negative association rule
mining. In order to use apriori, every transaction needs to be updated with all the
items—those that are present in the transaction and those that are absent. This will heavily
inflate the database. In our case, every transaction will have 16,000 items. We can cheat; we
can leverage the use of apriori for negative association rule mining only for a selected list
of items. In transactions that do not contain the item, we can create an entry for that item to
indicate that the item is not present in the transaction. The arules package's
function, addComplement, allows us to do exactly that.
When we pass this transaction to addComplement and say that we want non-Banana entries
to be added to the transaction, the resulting transaction from addComplement will be as
follows:
Banana, Strawberries
Onions, ginger, garlic, !Banana
Milk, Banana
An exclamation mark is the standard way to indicate the absence; however, you can choose
your own prefix:
get.neg.rules <- function(transactions, itemList, support, confidence){
# Generate negative association rules for given support confidence value
#
# Args:
[ 50 ]
Association Rule Mining Chapter 1
In the preceding code, we have created a get.neg.rules function. Inside this method, we
have leveraged the addComplement function to introduce the absence entry of the given
items in itemList into the transactions. We generate the rules with the newly formed
transactions, neg.transactions:
itemList <- c("Organic Whole Milk","Cucumber Kirby")
neg.rules <- get.neg.rules(transactions.obj,itemList, .05,.6)
Once we have the negative rules, we pass those through is.redundant to remove any
redundant rules and finally print the rules:
[ 51 ]
Association Rule Mining Chapter 1
#
# Script:
#
# RScript to explain negative associative rule mining
#
# Gopi Subramanian
#########################################################################
library(arules)
library(igraph)
[ 52 ]
Association Rule Mining Chapter 1
}
get.neg.rules <- function(transactions, itemList, support, confidence){
# Generate negative association rules for given support confidence value
#
# Args:
# transactions: Transaction object, list of transactions
# itemList : list of items to be negated in the transactions
# support: Minimum support threshold
# confidence: Minimum confidence threshold
# Returns:
# A data frame with the best set negative rules and their support and
confidence values
neg.transactions <- addComplement( transactions, labels = itemList)
rules <- get.rules(support, confidence, neg.transactions)
return(rules)
}
labels(neg.rules.nr)
Rules visualization
In the previous sections, we leveraged plotting capability from the arules and igraph
packages to plot induced rules. In this section, we introduce arulesViz, a package
dedicated to plot association rules, generated by the arules package. The arulesViz
package integrates seamlessly with the arules packages in terms of sharing data
structures.
[ 53 ]
Association Rule Mining Chapter 1
library(arules)
library(arulesViz)
[ 54 ]
Association Rule Mining Chapter 1
#
# Returns:
# rules object
parameters = list(
support = support,
confidence = confidence,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "rules"
# Induce Rules
all.rules <- get.rules(support, confidence, transactions.obj)
[ 55 ]
Association Rule Mining Chapter 1
[ 56 ]
Association Rule Mining Chapter 1
Wrapping up
The final step in any data analysis project is documentation—either generating a report of
the findings or documenting the scripts and data used. In our case, we are going to wrap up
with a small application. We will use RShiny, an R web application framework. RShiny is a
powerful framework for developing interactive web applications using R. We will leverage
the code that we have written to generate a simple, yet powerful, user interface for our
retail customers.
To keep things simple, we have a set of three screens. The first screen, as shown in the
following screenshot, allows the user to vary support and confidence thresholds and view
the rules generated. It also has additional interest measures, lift, conviction, and leverage.
The user can sort the rule by any of these interest measures:
[ 57 ]
Association Rule Mining Chapter 1
[ 58 ]
Association Rule Mining Chapter 1
Finally, a graph representation to view the product grouping for the easy selection of
products for the cross-selling campaign is as follows:
library(shiny)
library(plotly)
library(arules)
library(igraph)
library(arulesViz)
[ 59 ]
Association Rule Mining Chapter 1
#
# Returns:
# transaction object
transactions.obj <- read.transactions(file = data.path, format = "single",
sep = ",",
cols = columns,
rm.duplicates = FALSE,
quote = "", skip = 0,
encoding = "unknown")
return(transactions.obj)
}
[ 60 ]
Association Rule Mining Chapter 1
return(best.rules.df)
}
})
})
[ 61 ]
Association Rule Mining Chapter 1
ui <- fluidPage(
headerPanel(title = "X-Sell Recommendations"),
sidebarLayout(
sidebarPanel(
sliderInput("Support", "Support threshold:", min = 0.01, max = 1.0, value
= 0.01),
sliderInput("Confidence", "Support threshold:", min = 0.05, max = 1.0,
value = 0.05)
),
mainPanel(
tabsetPanel(
id = 'xsell',
tabPanel('Rules', DT::dataTableOutput('rulesTable')),
tabPanel('Explore', plotOutput('explorePlot')),
tabPanel('Item Groups', plotOutput('graphPlot'))
)
)
)
)
We have described the get.txn, get.rules, and find.rules functions in the previous
section. We will not go through them again here. The preceding code is a single page
RShiny app code; both the server and the UI component reside in the same file.
[ 62 ]
Association Rule Mining Chapter 1
),
mainPanel(
tabsetPanel(
id = 'xsell',
tabPanel('Rules', DT::dataTableOutput('rulesTable')),
tabPanel('Explore', plotOutput('explorePlot')),
tabPanel('Item Groups', plotOutput('graphPlot'))
)
)
)
)
We define the screen layout in this section. This section can also be kept in a separate file
called UI.R. The page is defined by two sections, a panel in the left, defined by
sidebarPanel, and a main section defined under mainPanel. Inside the side bar, we have
defined two slider controls for the support and confidence thresholds respectively. The
main panel contains a tab-separated window, defined by tabPanel.
The main panel has three tabs; each tab has a slot defined for the final set of rules, with their
interest measures, a scatter plot for the rules, and finally the graph plot of the rules.
})
[ 63 ]
Association Rule Mining Chapter 1
The cross.sell.rules data frame is defined as a reactive component. When the values of
the support and confidence thresholds change in the UI, cross.sell.rules data frame
will be recomputed. This frame will be served to the first page, where we have defined a
slot for this table, called rulesTable:
gen.rules <- reactive({
support <- input$Support
confidence <- input$Confidence
gen.rules <- get.rules( support, confidence ,transactions.obj)
return(gen.rules)
})
This reactive component retrieves the calculations and returns the rules object every time
the support or/and confidence threshold is changed by the user in the UI:
output$rulesTable <- DT::renderDataTable({
cross.sell.rules()
})
The preceding code renders the data frame back to the UI:
output$graphPlot <- renderPlot({
g <-plot.graph(cross.sell.rules())
plot(g)
})
The preceding two pieces of code render the plot back to the UI.
[ 64 ]
Association Rule Mining Chapter 1
Summary
We started the chapter with an overview of recommender systems. We introduced our
retail case and the association rule mining algorithm. Then we applied association rule
mining to design a cross-selling campaign. We went on to understand weighted association
rule mining and its applications. Following that, we introduced the HITS algorithm and its
use in transaction data. Next, we studied the negative association rules discovery process
and its use. We showed you different ways to visualize association rules. Finally, we
created a small web application using R Shiny to demonstrate some of the concepts we
learned.
In the next chapter, we will look at another recommendation system algorithm called
content based filtering. We will see how this method can help address the famous cold start
problem in recommendation systems. Furthermore, we will introduce the concept of fuzzy
ranking to order the final recommendations.
[ 65 ]
Fuzzy Logic Induced Content-
2
Based Recommendation
When a friend comes to you for a movie recommendation, you don't arbitrarily start
shooting movie names. You try to suggest movies while keeping in mind your friend's
tastes. Content-based recommendation systems try to mimic the exact same process.
Consider a scenario in which a user is browsing through a list of products. Given a set of
products and the associated product properties, when a person views a particular product,
content-based recommendation systems can generate a subset of products with similar
properties to the one currently being viewed by the user. Typical content-based
recommendation systems tend to also include the user profile. In this chapter, however, we
will not be including the user profiles. We will be working solely with the item/product
profiles. Content-based recommendation systems are also called content-based filtering
methods.
During winter, cars tend to have problems starting. But once the engine reaches the optimal
temperature it starts to run smoothly. Recommender systems tend to have the same cold
start problem. When deployed for the first time, the recommender system is unaware of the
user preferences. It's in the dark when it comes to recommendations. Other recommender
systems that use user preferences, such as collaborative filtering, which we covered in the
previous chapter, need user preferences to make recommendations. In the absence of user
preferences, these methods become ineffective.
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Content-based recommendation is the best method to address the cold start problem. Since
content-based methods rely on the product properties to create recommendations, they can
ignore the user preferences, to begin with. While the content-based method dishes out the
needed recommendation, the user profile can be built in the background. With a sufficient
user profile, content-based methods can be further improved or can move on to using
collaborative filtering methods.
A news aggregator collects syndicated web content such as new articles, blogs, video, and
similar items at a centralized location for easy viewing. We will use an imaginary news
aggregator website as an example to explain and build a content-based recommendation
system. Our use case is as follows. When a person is reading a particular news article, we
want to recommend to him other news article which might interest him. This will help us
understand how the cold-start problem in recommender systems is addressed using
content-based methods. Furthermore, our fuzzy logic rule system will shed light on how a
fuzzy system can be leveraged for ranking.
The code for this chapter was written in RStudio version 0.99.491. It uses R version 3.3.1. As
we work through our examples, we will introduce the R packages that we will be using.
During our code description, we will be using some of the output printed in the console. We
have included what will be printed in the console immediately following the statement
which prints the information to the console, so as to not disturb the flow of the code.
[ 67 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
This dataset is the result of the chemical analysis of wine grown in the same region in Italy.
We have data from three different cultivars (From an assemblage of plants selected for desirable
characters, Wikipedia: https://en.wikipedia.org/wiki/Cultivar).
> head(wine.data)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1: 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
2: 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
3: 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
4: 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
5: 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
6: 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
1 2 3
59 71 48
>
[ 68 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Let's remove the cultivar and only retain the chemical properties of the wine:
> wine.type <- wine.data[,1]
> wine.features <- wine.data[,-1]
wine.features has all the properties and without the cultivar column.
Let's add the row names and give an integer number for each wine:
> rownames(wine.mat) <- seq(1:dim(wine.features.scaled)[1])
> wine.mat[1:2,]
V2 V3 V4 V5 V6 V7 V8
V9 V10 V11
1 1.5143408 -0.5606682 0.2313998 -1.166303 1.90852151 0.8067217 1.0319081
-0.6577078 1.2214385 0.2510088
2 0.2455968 -0.4980086 -0.8256672 -2.483841 0.01809398 0.5670481 0.7315653
-0.8184106 -0.5431887 -0.2924962
V12 V13 V14
1 0.3611585 1.842721 1.0101594
2 0.4049085 1.110317 0.9625263
>
With our matrix ready, let's find the similarity between the wines.
We have numbered our rows representing the wine. The columns represent the properties
of the wine.
cov() is the covariance, and it's divided by the standard deviation of x and standard
deviation of y.
[ 69 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
In our case, we want to find the pearson coefficient between the rows. We want the
similarity between two wines. Hence we will transpose our matrix before invoking cor
function.
> dim(cor.matrix)
[1] 178 178
> cor.matrix[1:5,1:5]
1 2 3 4 5
1 1.0000000 0.7494842 0.5066551 0.7244043066 0.1850897291
2 0.7494842 1.0000000 0.4041662 0.6896539740 -0.1066822182
3 0.5066551 0.4041662 1.0000000 0.5985843958 0.1520360593
4 0.7244043 0.6896540 0.5985844 1.0000000000 -0.0003942683
5 0.1850897 -0.1066822 0.1520361 -0.0003942683 1.0000000000
We transpose the wine.mat matrix and pass it to the cor function. In the transposed
matrix, our output will be the similarity between the different wines.
The cor.matrix matrix is the similarity matrix, which shows how closely related items are.
The values range from -1 for perfect negative correlation, when two items have attributes
that move in opposite directions, and +1 for perfect positive correlation, when attributes for
the two items move in the same direction. For example, in row 1, wine 1 is more similar to
wine 2 than wine 3. The diagonal values will be +1, as we are comparing a wine to itself.
> user.view
V2 V3 V4 V5 V6 V7 V8
V9 V10 V11
3 0.1963252 0.02117152 1.106214 -0.2679823 0.08810981 0.8067217 1.212114
-0.497005 2.129959 0.2682629
V12 V13 V14
3 0.3174085 0.7863692 1.391224
[ 70 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Let's a say a particular user is either tasting or looking at the properties of wine 3. We want
to recommend him wines similar to wine 3.
> sim.items
1 2 3 4 5 6
7 8 9
0.50665507 0.40416617 1.00000000 0.59858440 0.15203606 0.54025182
0.57579895 0.18210803 0.42398729
10 11 12 13 14 15
16 17 18
0.55472235 0.66895949 0.40555308 0.61365843 0.57899194 0.73254986
0.36166695 0.44423273 0.28583467
19 20 21 22 23 24
25 26 27
0.49034236 0.44071794 0.37793495 0.45685238 0.48065399 0.52503055
0.41103595 0.04497370 0.56095748
28 29 30 31 32 33
34 35 36
0.38265553 0.36399501 0.53896624 0.70081585 0.61082768 0.37118102
-0.08388356 0.41537403 0.57819928
37 38 39 40 41 42
43 44 45
0.33457904 0.50516170 0.34839907 0.34398394 0.52878458 0.17497055
0.63598084 0.10647749 0.54740222
46 47 48 49 50 51
52 53 54
-0.02744663 0.48876356 0.59627672 0.68698418 0.48261764 0.76062564
0.77192733 0.50767052 0.41555689.....
We look at the third row in our similarity matrix. We know that the similarity matrix has
stored all the item similarities. So the third row gives us the similarity score between wine 3
and all the other wines. The preceding results are truncated.
> sim.items.sorted[1:5]
3 52 51 85 15
1.0000000 0.7719273 0.7606256 0.7475886 0.7325499
>
[ 71 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
First, we sort row 3 in decreasing order, so we have all the items close to wine 3 popping to
the front. Then we pull out the top five matches. Great--we want to recommend wines 52,
51, 85, and 15 to this user. We ignore the first recommendation as it will be the same item
we are searching for. In this case, the first element will be wine 3 with a similarity score of
1.0.
Let's look at the properties of wine 3 and the top five matches to confirm our
recommendation:
> rbind(wine.data[3,]
+ ,wine.data[52,]
+ ,wine.data[51,]
+ ,wine.data[85,]
+ ,wine.data[15,]
+ )
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1: 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
2: 1 13.83 1.65 2.60 17.2 94 2.45 2.99 0.22 2.29 5.60 1.24 3.37 1265
3: 1 13.05 1.73 2.04 12.4 92 2.72 3.27 0.17 2.91 7.20 1.12 2.91 1150
4: 2 11.84 0.89 2.58 18.0 94 2.20 2.21 0.22 2.35 3.05 0.79 3.08 520
5: 1 14.38 1.87 2.38 12.0 102 3.30 3.64 0.29 2.96 7.50 1.20 3.00 1547
>
Great—you can see that the wine properties in our recommendation are close to the
properties of wine 3.
[ 72 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
[ 73 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We use the cnames vector to define these headings as we read the file using the read_tsv
function. Further inside read_tsv, while defining the column types, we also specify the
variable type for each of these columns.
# A tibble: 2,991 x 1
PUBLISHER
<chr>
1 1011now
2 10News
3 10TV
4 123Jump.com
5 12NewsNow.Com
6 13WHAM-TV
7 13abc Action News
8 14 News WFIE Evansville
9 "24\\/7 Wall St."
10 "2DayFM \\(blog\\)"
# ... with 2,981 more rows
[ 74 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
> head(publisher.count)
PUBLISHER ct
1 1011now 1
2 10News 4
3 10TV 2
4 123Jump.com 1
5 12NewsNow.Com 3
6 13WHAM-TV 3
> dim(publisher.count)
[1] 2991 2
We first find the number of articles under each publisher. Looks like a lot of publishers have
very few articles. Let's validate it to see the number of publishers with less than 10 articles.
We can see 2,820 publishers, out of 2,991, have less than ten articles.
Let's get the top 100 publishers by looking at the number of articles they have published:
> publisher.top <- head(publisher.count[order(-publisher.count$ct),],100)
> head(publisher.top)
PUBLISHER ct
1937 Reuters 90
309 Businessweek 58
1548 NASDAQ 49
495 Contactmusic.com 48
540 Daily Mail 47
882 GlobalPost 47
>
We can see that Reuters tops the list. We have retained only the articles from the top 100
publishers list for our exercise. Data frame publisher.top has the top 100 publishers.
[ 75 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
For our top 100 publishers, let's now get their articles and other information:
> data.subset <- inner_join(publisher.top, data)
Joining, by = "PUBLISHER"
> head(data.subset)
PUBLISHER ct ID
TITLE
1 Reuters 90 38081 PRECIOUS-Gold ticks lower, US dollar holds
near peak
2 Reuters 90 306465 UKs FTSE rallies as Rolls-Royce races
higher
3 Reuters 90 371436 US economic growth to continue at modest pace - Feds
Lacker
4 Reuters 90 410152 Traders pare bets on earlier 2015 Fed
rate hike
5 Reuters 90 180407 FOREX-Dollar slides broadly, bullish data
helps euro
6 Reuters 90 311113 Fitch Publishes Sector Credit Factors for Japanese
Insurers
URL
1
http://in.reuters.com/article/2014/03/24/markets-precious-idINL4N0ML03U2014
0324
2
http://www.reuters.com/article/2014/06/19/markets-britain-stocks-idUSL6N0P0
1DM20140619
3
http://in.reuters.com/article/2014/07/08/usa-fed-idINW1N0OF00M20140708
4
http://www.reuters.com/article/2014/08/01/us-usa-fed-futures-idUSKBN0G144U2
0140801
5
http://in.reuters.com/article/2014/05/06/markets-forex-idINL6N0NS25P2014050
6
6
http://in.reuters.com/article/2014/06/24/fitch-publishes-sector-credit-fact
ors-fo-idINFit69752320140624
CATEGORY STORY HOSTNAME TIMESTAMP
1 b df099bV_5_nKjKMqxhiVh1yCmHe3M in.reuters.com 1.395753e+12
2 b dShvKWlyRq_Z3pM1C1lhuwYEY5MvM www.reuters.com 1.403197e+12
3 b dNJB5f4GzH0jTlMeEyWcKVpMod5UM in.reuters.com 1.404897e+12
4 b dunL-T5pNDVbTpMZnZ-3oAUKlKybM www.reuters.com 1.406926e+12
5 b d8DabtTlhPalvyMKxQ7tSGkTnN_9M in.reuters.com 1.399369e+12
6 b d3tIMfB2mg-9MZM4G_jGTEiRVl3jM in.reuters.com 1.403634e+12
> dim(data.subset)
[1] 2638 9
>
[ 76 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We join our top 100 publishers data frame publisher.top with data, get all the details for
our top 100 publishers. Our data.subset has a total of 2,638 articles.
Having looked at our data, let's now move on to design our recommendation engine.
Let's quickly recap how a content-based recommendation engine works. When a user is
browsing a product or item, we need to provide recommendations to the user in the form of
other products or items from our catalog. We can use the properties of the items to come up
with the recommendations. Let's translate this to our use case.
So when a user is browsing a particular news article, we need to give him other news
articles as recommendations, based on:
We are going to introduce another feature. It is a calculated feature from the text field:
Polarity of the document. Subjective text tends to have an opinion about a topic.
The opinion can be positive, negative, or neutral. A polarity score is a real
number which captures these opinions. Polarity identification algorithms use text
mining to get the document opinion. We are going to use one such algorithm to
get the polarity of our texts.
[ 77 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
For the wine example in the last section, we used the pearson coefficient as a similarity
measure. Unlike the wine example, we need multiple similarity measures for this use case:
We will explain these distance measures as we program them in R. Hopefully this gave you
a good overview of our problem statement. Let's move on to the design of our content-
based recommendation engine.
We are going to design our content-based recommendation engine in three steps, as shown
in the following diagram:
[ 78 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
In step 1, we will create a similarity index. Think of this index as a matrix, with rows and
columns as the articles and the cell value storing the similarity between the articles. The
diagonal values of this matrix will be one. The similarity score is a value between zero and
one. A cell value of 1.0 indicates that the two articles are an exact replica of each other.
Look at the first row; article 1 is closer to article N when compared to article 2. In our use
case, the cell value will be the cosine similarity between two documents.
In step 2, we have a simple search engine. Given an article, this engine will first retrieve the
top N articles, in our case the top 30, which are close to the given article, based on the
similarity matrix developed in the previous step. Say we are searching for article number 2,
then row 2 of this matrix will be accessed. After sorting the content of the row, the top 30
will be given as the match. For those 30 articles, we further calculate more features the
polarity of the articles. After that, we calculate Manhattan distance between the polarity
value of the given article and all the other articles in our search results. We find the
Jaccard's distance between the article we are searching for and all the other articles in the
search list based on the publisher and category.
In step 3, we implement a fuzzy ranking engine. Using the similarity score from step 1,
Jaccard's score, and the polarity scores, we use a fuzzy engine to rank the top 30 matches.
The results are presented to the user in this ranked order.
[ 79 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Bag-of-words
In bag-of-words representation, every document is represented as a collection of words
present in the document, hence the name bag-of-words. The order in which the words occur
are not considered in bag of words approach. A good way to organize these bag-of-words is
using a matrix representation. Let's say we have 100 documents (also called a corpus). In
order to build such a matrix, we first make a list of all the unique words present in those 100
documents. This is called the vocabulary of our text corpus. Say we have 5,000 unique
words. Our matrix is now a 100 x 5,000 dimension, where the rows are the document and in
the column, we have the words from our vocabulary. Such a matrix is called a document
term matrix.
[ 80 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We have represented each sentence as a vector here. We can now leverage the vector space
model to find the similarity between these two sentences. Now let's look at the cell values of
this matrix.
The example matrix is a binary matrix. A cell value of 1 indicates the presence of a word in
a document and 0 indicates absence. We can have other weighting schemes. Let's look at
some weighting schemes.
Term frequency
Term frequency, or tf(w, d), is the number of times a word w appears in the document d.
If we look at our example:
tf(dog, sentence 1) = 1
Document frequency
Document frequency or df(w) is the number of documents in the corpus in which this
word occurs. In our example, df(dog) = 2, dog appears in both the sentences.
Rare words tend to have high IDFs. If we look at the equation above and ignore the log
function, IDF is the total number of documents divided by the number of documents that
the word is in. In effect, it is measuring how good the word is at identifying this document.
If the word is only used in a small number of documents, then the denominator will be low,
therefore the overall value will be high. The log is just used to normalize the result.
[ 81 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
TFIDF
This is the product of a term frequency and inverse document frequency:
title.df stores the title and the article ID. others.df stores the article ID, publisher, and
category.
We create a data frame reader using readTabular. Next, we use the Corpus function to
create our text corpus. To Corpus, we pass our title.df data frame, and
the title.reader data frame reader.
Calling getTransformation shows us the list of available functions that can be used to
transform the text:
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
[ 82 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We remove punctuation, numbers, unnecessary white spaces, and stop words from our
articles. Finally, we convert our text to lowercase.
Punctuation, numbers, and whitespace may not be a good feature to distinguish one article
from another. Hence, we remove them.
These words will be present again in most of the English text, no matter what the content is.
They cannot act as a good feature to distinguish our articles. Hence, we remove such words.
We want our algorithms to treat the words "dog" and "Dog" the same way. Hence, we
bring them to lowercase.
Furthermore, we could apply stemming to our words. Stemming reduces the word to its
root form.
[ 83 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
> inspect(dtm[1:5,10:15])
<<DocumentTermMatrix (documents: 5, terms: 6)>>
Non-/sparse entries: 0/30
Sparsity : 100%
Maximal term length: 9
Sample :
Terms
Docs abbey abbvie abc abcs abdul abenomics
180407 0 0 0 0 0 0
306465 0 0 0 0 0 0
371436 0 0 0 0 0 0
38081 0 0 0 0 0 0
410152 0 0 0 0 0 0
>
We use the DocumentTermMatrix function to create our matrix. We pass our text corpus
and also pass a list to the parameter control. Inside the list, we say that we are interested
only in words with length, so the number of characters between 3 and 10. For our cell
values in our matrix, we want them to be TFIDF.
We can inspect the created document term matrix by calling the inspect function.
Having created a document term matrix, let's create the cosine distance between the articles:
sim.score <- tcrossprod_simple_triplet_matrix(dtm)/(sqrt( row_sums(dtm^2)
%*% t(row_sums(dtm^2)) ))
In the preceding code, we take a document term matrix and return a document matrix. We
will be calling this a similarity matrix going forward. The cell values of this matrix
correspond to the cosine similarity between two documents. The previous line of code
implements this cosine similarity.
[ 84 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
The cosine similarity between two vectors A and B of length n is given as:
Look at the numerator of this equation. It's the inner product of both the vectors. In a vector
space model, the inner product of two vectors gives the similarity between them. We divide
this inner product by the l2 norm of the those individual vector. This makes the score
bounded between -1 and 1. The similarity is hence a number between -1 and 1, where -1
indicates that two documents are completely different from each other. If both the vectors
have non-negative numbers, then the similarity score is bounded between 0 and 1.
Now, in our document term matrix, we have a row for each article. The columns are again
the article IDs.
[ 85 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
0.0000000 0.0000000
180407 0.0000000 0 0 0 1 0 0 0
0.0000000 0.0000000
311113 0.0000000 0 0 0 0 1 0 0
0.0000000 0.0000000
263442 0.0000000 0 0 0 0 0 1 0
0.0000000 0.0000000
171310 0.0000000 0 0 0 0 0 0 1
0.0000000 0.0000000
116144 0.1428571 0 0 0 0 0 0 0
1.0000000 0.0000000
70584 0.1543033 0 0 0 0 0 0 0
0.0000000 1.0000000
Let's look at row 1, document number 38081. Looks like it's not close, with a lot of
documents displayed there, except for 116144.
Searching
Having created our similarity matrix, we can leverage that matrix to find a match for any
given document. Let's see how to leverage this matrix to perform the search function in this
section.
[ 86 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We will be using the sim.score created in the previous step to perform the search.
We go to our match.doc similarity matrix and pick up row 38081. Now, this row has all
the other articles and their similarity scores.
Our match.df data frame now contains all the matching documents for 38081 and their
cosine scores. No wonder the first row is 38081; it has to match itself perfectly.
[ 87 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
But as we said before, we are going to recommend only the top 30 matches:
> match.refined<-head(match.df[order(-match.df$cosine),],30)
> head(match.refined)
ID cosine
38081 38081 1.0000000
38069 38069 0.3779645
231136 231136 0.2672612
334088 334088 0.2672612
276011 276011 0.2519763
394401 394401 0.2390457
>
So let's order our match.df data frame in descending order of cosine similarity and extract
the top 30 matches using the head function.
Now that we have the matching documents, we need to present them in a ranked order. In
order to rank the results, we are going to calculate some additional measures and use fuzzy
logic to get the final ranking score.
Before we go ahead and calculate the additional measures, let's merge title.df and
other.df with match.refined:
> head(match.refined)
ID cosine
TITLE
1 38081 1.0000000 PRECIOUS-Gold ticks lower,
US dollar holds near peak
2 38069 0.3779645 PRECIOUS-Bullion drops nearly 1 pct on dollar, palladium
holds near 2-1/2-yr high
3 231136 0.2672612 Dollar steady near 3-1/2 month
lows vs. yen, Aussie weaker
4 334088 0.2672612 Canadian dollar falls amid
lower than expected GDP data
5 276011 0.2519763 Gold holds near four-month low as
ECB move on rates awaited
6 394401 0.2390457 Dollar Tree Will Buy Competitor
Family Dollar For $8.5 Billion
PUBLISHER CATEGORY
1 Reuters b
2 Reuters b
[ 88 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
3 NASDAQ b
4 CTV News b
5 Business Standard b
6 The Inquisitr b
We have all the information and the cosine similarity in one data frame now.
Polarity scores
We are going to leverage the sentimentr R package to learn the sentiments of the articles
we have collected.
Let's look at how to score using the sentiment function :
> sentiment.score <- sentiment(match.refined$TITLE)
> head(sentiment.score)
element_id sentence_id word_count sentiment
1: 1 1 8 0.00000000
2: 2 1 11 0.00000000
3: 3 1 9 -0.13333333
4: 4 1 9 -0.08333333
5: 5 1 11 0.07537784
6: 6 1 9 0.00000000
>
The sentiment function in sentimentr calculates a score between -1 and 1 for each of the
articles. In fact, if a text has multiple sentences, it will calculate the score for each sentence.
A score of -1 indicates that the sentence has a very negative polarity. A score of 1 means that
the sentence is very positive. A score of 0 refers to the neutral nature of the sentence.
However, we need the score at an article level and not at a sentence level, so we can take an
average value of the score across all the sentences in a text.
Calculate the average value of the sentiment scores for each article:
> sentiment.score <- sentiment.score %>% group_by(element_id) %>%
+ summarise(sentiment = mean(sentiment))
> head(sentiment.score)
# A tibble: 6 x 2
element_id sentiment
<int> <dbl>
1 1 0.00000000
2 2 0.00000000
3 3 -0.13333333
4 4 -0.08333333
5 5 0.07537784
6 6 0.00000000
[ 89 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Here, the element_id refers to the individual article. By grouping element_id and
calculating the average, we can get the sentiment score at an article level. We now have the
scores for each article.
Let's update the match.refined data frame with the polarity scores:
> match.refined$polarity <- sentiment.score$sentiment
> head(match.refined)
ID cosine
TITLE
1 38081 1.0000000 PRECIOUS-Gold ticks lower,
US dollar holds near peak
2 38069 0.3779645 PRECIOUS-Bullion drops nearly 1 pct on dollar, palladium
holds near 2-1/2-yr high
3 231136 0.2672612 Dollar steady near 3-1/2 month
lows vs. yen, Aussie weaker
4 334088 0.2672612 Canadian dollar falls amid
lower than expected GDP data
5 276011 0.2519763 Gold holds near four-month low as
ECB move on rates awaited
6 394401 0.2390457 Dollar Tree Will Buy Competitor
Family Dollar For $8.5 Billion
PUBLISHER CATEGORY polarity
1 Reuters b 0.00000000
2 Reuters b 0.00000000
3 NASDAQ b -0.13333333
4 CTV News b -0.08333333
5 Business Standard b 0.07537784
6 The Inquisitr b 0.00000000
Before we move on, let's spend some time understanding the inner workings of our
dictionary-based sentiment method. The sentiment function utilizes a sentiment lexicon
(Jockers, 2017) from the lexicon package. It preprocesses the given text as follows:
[ 90 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Each word is looked up in the lexicon; positive and negative words are tagged with +1 and
-1 respectively. Let's call the words which have received a score the polarized words. Not
all words receive a score. Only those found in the lexicons receive a score. We can pass a
customer lexicon through the polarity_dt parameter to the sentiment function. For each
of the polarized words, n words before them and n words after them are considered, and
together they are called polarized context clusters. The parameter n can be set by the user.
The words in the polarized context cluster can be tagged as either of the following:
neutral
negator
amplifier
de-amplifier
adversative conjunctions
Jaccard's distance
While ranking the matched articles, we want to also include the category and publisher
columns.
We need the publisher, category, and the sentiment details of the document we are
searching for. Fortunately, the first row of our match.refined data frame stores all the
details related to 38081. We retrieve those values from there.
[ 91 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
For the rest of the articles, we need to find out if they match the publisher and category of
document 38081:
match.refined$is.publisher <- match.refined$PUBLISHER == target.publisher
match.refined$is.publisher <- as.numeric(match.refined$is.publisher)
Now we can go into match.refined and create a new column called is.publisher, a
Boolean column to say if the article's publisher is same as the publisher for the one we are
searching for.
Repeat the same for the category. We have created a new column called is.category to
store the category match.
With the two new columns, we can calculate the Jaccard's distance between document
38081 and all the other documents in the match.refined data frame, as shown in the
following code block:
match.refined$jaccard <- (match.refined$is.publisher +
match.refined$is.category)/2
Jaccards distance/index
The Jaccard index measures the similarity between two sets, and is a ratio of the size of the
intersection and the size of the union of the participating sets. Here we have only have two
elements, one for publisher and one for category, so our union is 2. The numerator, by
adding the two Boolean variable, we get the intersection.
Finally, we also calculate the absolute difference (Manhattan distance) in the polarity values
between the articles in the search results and our search article. We do a min/max
normalization of the difference score as follows:
match.refined$polaritydiff <- abs(target.polarity -
match.refined$polarity$sentiment)
[ 92 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
[ 93 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We remove some of the unwanted fields from the match.refined data frame. Finally, we
have the ID, cosine distance, title, publisher, category, Jaccard score, and the polarity
difference.
[ 94 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We need to perform our ranking based on the following metrics we have calculated:
Cosine similarity
Jaccard index
Polarity difference
A standard way of doing this would be to come up with another score, which is a linear
combination of the preceding three scores. We use that final score to sort the results.
However, we will leverage fuzzy logic programming to do the search result ranking.
Fuzzy logic
A detailed discussion on fuzzy logic is beyond the scope of this book. A short tutorial on
fuzzy logic, http://cs.bilkent.edu.tr/~zeynep/files/short_fuzzy_logic_tutorial.
pdf
We will give a short introduction here. Let us use an example from Artificial Intelligence: A
Guide to Intelligent Systems book by Michael Negnevistky. Thanks to the article from
https://vpetro.io/
Fuzzification
Fuzzification is the process where we convert our input and output to linguistic variables
using ranges and membership functions.In this example, we want to convert our input,
project funding and staffing, and output risk into linguistic variables:
Project funding: Inadequate, marginal and adequate. These are the three
linguistic variables we have for project funding. We will use a fuzzy cone
membership function.
Project staffing: Small and large, a fuzzy cone membership function
Risk: low, normal and high.
[ 95 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Linguistic Variable: Variables whose values are words in a natural language. Let
us take the example of Project Funding, the actual values of project funding in
this example are percentages. So if we say the project funding is 50% then, 50% of
the funding is still available. This 50% is now transalated to the linguistic
variable, adequate, marginal or inadequate. The fuzzy system will use only these
linguistc variables and not the original value.
Crisp Values: The real values an input or output can take is a crisp value. For
example, 50% is the crisp value Project funding can take.
Membership Function: Its a function which defines how the crisp values are
mapped to a membership value (degree of membership). These functions can be
either linear or curved. For example,let us again take project funding,
Membership function will translate this value into degree of membership for each of the
linguistic variables. We will not show here the detail of how the membership function
converts our crisp input values to linguistic variables, but will give a simple example to
follow.
Let us say we are trying to evaluate the risk for the following crisp values,
When we pass these values to the respective membership function we get the following
results, Inadequate = 0.5, adequate = 0.0 and marginal = 0.2; similarly for project staffing we
get small = 0.1 and large =0.7
The crisp input value for project funding 35%, cannot be represented by the linguistic
variable adequate, hence membership value for adequate is 0.0. It does not belong
completely to marginal linguistic variable, and hence we have 0.2 as the value.
Similarly 70% of project staffing has a high degree of membership with linguistic variable
large, hence the value 0.7.
As you saw, fuzzification converts the crisp input values to fuzzy membership values to
linguistic variables.
[ 96 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Rules define the interaction between our input variables to produce the output. Here we
have three simple rules for our example:
This rule has an OR clause, its evaluated using UNION operator. Max operator in set is used
to evaluate this rule. There are several other set operators such as ASUM or BSUM which
can be used to evaluate the UNION (OR) operator.
Let us borrow the fuzzification results from the previous section. The fuzzification of project
funding = adequate for crisp value 35% is 0.0. The fuzzification of project staffing = small for
crisp value 60% is 0.1
Let us use the Max operator Maximum among them is taken as the results, hence the this
rule evaluates to, Project Risk = low = 0.1
The next rule, If project funding is marginal and project staffing is large then risk is normal,
has an AND and hence we use an intersection operator. Intersection operator can be
handled using min, prod or bdif set operator.
The fuzzification of project funding = marginal for crisp value 35% is 0.2. The fuzzification
of project staffing =large for crisp value 35% is 0.7. Let us use the min operator here. The
minimum is 0.2, the rule evaluates to Project risk = normal = 0.2
Finally, the last rule, If project funding is inadequate then risk his high. We can directly say
Project Risk = high = 0.5, as this is the fuzzification value for project funding = inadequate
for crisp value of 35%.
[ 97 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Now we need to combine these results. This process of evaluating the rules and combining
them is called as inference.
We have, for project risk, low = 0.1, normal = 0.2, and high = 0.5. We will use these values to
scale the membership function. The final decision plot is a summation of all these scaled
values.
Defuzzification
In fuzzification, we converted the crisp values of our input to fuzzy degree of membership
values for our linguistic variable. In defuzzification, we need to convert the degree of
membership values for our output variable to its crisp value.
Let us superimpose the membership function on top of the above decision plot. We have
assumed a triangular membership function here. You can read more about triangular
membership function and their importance in the paper why triangular memebership
functions? http://www.sciencedirect.com/science/article/pii/0165011494900035
[ 98 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
The most popular defuzzification method is the centroid method. It calculates the center of
gravity for the area under the curve. The area under the green line represented in the graph.
It is calculated as follows:
( (0 + 10 + 20) * 0.1 + (30 + 40 + 50 + 60) * 0.2 + (70 + 80 + 90 + 100) * 0.5) / (0.1* 3+ 0.2 * 4 + 0.5
*4).
[ 99 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
Hence, we can conclude that the risk associated with the project is 67.4%.
In our example, we want to convert our inputs are cosine, Jaccard and polarity score and
our output is the ranking. We first convert them to linguistic variables. For example, let us
take our cosine score:
Then we define a membership function. The membership function is responsible for the
fuzzy membership. Let us say we define a cone as our membership function with a radius
of 0.2. There are no real mandates to choose the membership function. Refer to the
membership plot appearing a little later in this chapter for fuzzy memberships. This is
responsible for the fuzzification process. Let us say we pass our cosine score into this
membership function it changes it to one of the linguistic values.
We follow the same process for polarity score and Jaccard's distance. Finally, we also define
a final linguistic variable called ranking. This is our final ranking score or our output. Its
defined in the same way as cosine, polarity, and Jaccards. It has a range and a membership
function.
Having done the fuzzification process for our similarity scores, let us proceed to define the
rules.
We can define more complex rules involving multiple linguistic variables. Refer to the code
for different rules we have defined for this scenario. After all the rules are evaluated we can
then proceed to inference process.
The last step is the defuzzification of the above result to a crisp value. We use the centroid
method for defuzzification.
[ 100 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We will be using the sets R package for our fuzzy logic programming:
> library(sets, quietly = TRUE)
%>%
The first step is to set up our universe. We define the range of values and the granularity of
the values we will be dealing with in our universe. Our cosine, Jaccard, and polarity are all
normalized to have a value between zero and one. Hence, the range of our universe is set
between zero and one.
The first step in fuzzy logic programming is to define the linguistic variables we will be
dealing with:
variables <-
set(cosine =
fuzzy_partition(varnames =
c(vlow = 0.2, low = 0.4,
medium = 0.6, high = 0.8),
FUN = fuzzy_cone , radius = 0.2),
jaccard =
fuzzy_partition(varnames =
c(close = 1.0, halfway = 0.5,
far = 0.0),
FUN = fuzzy_cone , radius = 0.4),
polarity =
fuzzy_partition(varnames =
c(same = 0.0, similar = 0.3,close = 0.5,
away = 0.7),
FUN = fuzzy_cone , radius = 0.2),
ranking =
fuzzy_partition(varnames =
c(H = 1.0, MED = 0.7 , M = 0.5, L = 0.3),
FUN = fuzzy_cone , radius = 0.2
)
)
[ 101 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
For each variable, we define the various linguistic values and the fuzzy membership
function. For example, for our linguistic variable cosine, the linguistic values include vlow,
low, medium, and high. We define the fuzzy_cone membership function.
For this application, we have used a cone membership function throughout. There are other
membership functions available.
Calling the R help for fuzzy_cone, the documentation will detail other available
membership functions:
> help("fuzzy_cone")
We define the jaccard and polarity linguistic variables in the same way as cosine.
The ranking lat linguistic variable defines the final score we need, which we can use to
sort our results.
Based on the interaction between the linguistic variables cosine, jaccard, and polarity,
the ranking linguistic variables are assigned different linguistic values. These interactions
are defined as rules. Having defined the linguistic variables, the linguistic values, and the
membership function, we proceed to write down our fuzzy rules:
rules <-
set(
######### Low Ranking Rules ###################
fuzzy_rule(cosine %is% vlow,
ranking %is% L),
fuzzy_rule(cosine %is% low || jaccard %is% far
|| polarity %is% away,
ranking %is% L),
fuzzy_rule(cosine %is% low || jaccard %is% halfway
|| polarity %is% away,
ranking %is% L),
[ 102 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
The rules define how the linguistic variables should be combined to make a decision. We
have three groups of rules. Let's look at the first rule:
fuzzy_rule(cosine %is% vlow, ranking %is% L),
If the cosine linguistic value is vlow, the ranking is given as L, the lowest ranking. Similarly,
each rule captures the interactions between the linguistic variables.
With the rules and linguistic variables defined, we can now put our complete fuzzy
system together:
> ranking.system <- fuzzy_system(variables, rules)
> print(ranking.system)
A fuzzy system consisting of 4 variables and 14 rules.
variables:
rules:
[ 103 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
cosine %is% low && jaccard %is% close && polarity %is% similar => ranking
%is% M
cosine %is% medium && jaccard %is% close && polarity %is% same => ranking
%is% MED
cosine %is% medium && jaccard %is% close && polarity %is% similar =>
ranking %is% MED
cosine %is% medium && jaccard %is% halfway && polarity %is% same => ranking
%is% MED
cosine %is% medium && jaccard %is% halfway && polarity %is% similar =>
ranking %is% MED
cosine %is% low || jaccard %is% far || polarity %is% away => ranking %is% L
cosine %is% low || jaccard %is% close || polarity %is% same => ranking %is%
M
cosine %is% low || jaccard %is% halfway || polarity %is% away => ranking
%is% L
cosine %is% low || jaccard %is% halfway || polarity %is% same => ranking
%is% L
cosine %is% low || jaccard %is% halfway || polarity %is% close => ranking
%is% L
cosine %is% low || jaccard %is% halfway || polarity %is% similar => ranking
%is% L
cosine %is% medium || jaccard %is% far || polarity %is% away => ranking
%is% L
cosine %is% high => ranking %is% H
cosine %is% vlow => ranking %is% L
> plot(ranking.system)
[ 104 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
The final plot reveals the fuzziness in the boundary for different linguistic variables.
Compare this with a hard-coded if-else logic system.
We can now proceed to use this system to do the ranking. Let's do the ranking on a single
example:
fi <- fuzzy_inference(ranking.system, list(cosine = 0.5000000, jaccard =
0, polarity=0.00000000))
[ 105 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
For given values of cosine, polarity, and Jaccard, we get a ranking score of 0.4. Now we can
use this score to rank the results.
[ 106 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
The get.ranks function is applied in each row of match.refined to get the fuzzy
ranking. Finally, we sort the results using this ranking.
This brings us to the end of our design and implementation of a simple fuzzy-induced
content-based recommendation system.
Complete R Code
The complete R code is shown as follows:
library(data.table)
wine.data <-
fread('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.
data')
head(wine.data)
table(wine.data$V1)
[ 107 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
cor.matrix[1:5,1:5]
rbind(wine.data[3,]
,wine.data[52,]
,wine.data[51,]
,wine.data[85,]
,wine.data[15,]
)
library(tidyverse)
library(tidytext)
library(tm)
library(slam)
[ 108 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
head(data)
data %>% group_by(PUBLISHER) %>% summarise()
data %>% group_by(CATEGORY) %>% summarise()
getTransformations()
stopwords("english")
[ 109 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
sim.score[1:10,1:10]
head(match.refined)
[ 110 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
# Calcuate Jaccards
match.refined$jaccard <- (match.refined$is.publisher +
match.refined$is.category)/2
match.refined$polaritydiff <- abs(target.polarity - match.refined$polarity)
head(match.refined)
## clean up
match.refined$is.publisher = NULL
match.refined$is.category = NULL
match.refined$polarity = NULL
match.refined$sentiment = NULL
head(match.refined)
variables <-
set(cosine =
fuzzy_partition(varnames =
c(vlow = 0.2, low = 0.4,
medium = 0.6, high = 0.8),
FUN = fuzzy_cone , radius = 0.2),
jaccard =
fuzzy_partition(varnames =
c(close = 1.0, halfway = 0.5,
far = 0.0),
FUN = fuzzy_cone , radius = 0.4),
polarity =
fuzzy_partition(varnames =
c(same = 0.0, similar = 0.3,close = 0.5,
away = 0.7),
FUN = fuzzy_cone , radius = 0.2),
ranking =
fuzzy_partition(varnames =
c(H = 1.0, MED = 0.7 , M = 0.5, L = 0.3),
FUN = fuzzy_cone , radius = 0.2
)
[ 111 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
rules <-
set(
######### Low Ranking Rules ###################
fuzzy_rule(cosine %is% vlow,
ranking %is% L),
fuzzy_rule(cosine %is% low || jaccard %is% far
|| polarity %is% away,
ranking %is% L),
fuzzy_rule(cosine %is% low || jaccard %is% halfway
|| polarity %is% away,
ranking %is% L),
[ 112 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
plot(ranking.system)
Summary
We started the chapter by introducing content-based filtering. We discussed how content
based filtering methods can help with cold-start problems in recommendation systems. We
then explained the new aggregator use case. We explored the data provided by the
customer--various news articles from different publishers belonging to different categories.
Based on the data, we came up with a design for our content-based recommendation
system.
We implemented a similarity dictionary; given a news article, this dictionary would be able
to provide the top N matching articles. The similarity was calculated based on the words
present in the article. We leveraged the vector space model for text and ultimately used the
cosine distance to find the similarities between articles.
[ 113 ]
Fuzzy Logic Induced Content-Based Recommendation Chapter 2
We implemented a simple search based on the similarity dictionary to get a list of matching
news articles. We introduced additional features for the matching document, sentiment,
and polarity score.
In the next chapter, we will go into the details of various approaches to collaborative
filtering algorithm. Collaborative filtering is the most powerful way to build recommender
systems.
[ 114 ]
Collaborative Filtering
3
Given a database of user ratings for products, where a set of users have rated a set of
products, collaborative filtering algorithms can give ratings for products yet to be rated by a
particular user. This leverages the neighborhood information of the user to provide such
recommendations. The input to collaborative filtering is a matrix, where the rows are users
and the columns are items. Cell values are the ratings provided by the user for a product.
Ratings of products are ubiquitous in today's internet world. IMDB, Yelp, Amazon, and
similar systems today have a rating system deployed to capture user preferences.
Preferences are typically captured by a rating system, where the ratings are defined as stars
or a points system.
An online system, where the whole user product preference matrix is loaded into
the memory to make recommendations for a user and his yet to be rated product
combination
A model-based system, where an approach similar to clustering is used to group
users with similar preferences and later use those cluster centroids to make
recommendations
Collaborative Filtering Chapter 3
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As
we work through our example, we will introduce the R recommenderlab packages that we
will be using. During our code description, we will be using some of the output printed in
the console. We have included what will be printed in the console immediately after the
statement that prints the information to the console, so as to not disturb the flow of the
code.
Collaborative filtering
Given a user rating matrix, where several users have rated several products, the goal of
collaborative filtering is as follows:
The underlying premise of the collaborative filtering algorithm is that if two users agree on
ratings for a large set of items, they may tend to agree for other items too. Let us use a small
R code snippet to explain this concept. Assume we have seven products (A, B, C, D, E, F, G)
and two users (user.a and user.b). We also have the ratings provided by both of the
users for some of the products. The ratings are range of numbers from 1 to 5, with 1
indicating a poor rating, 5 indicating a good rating, and 0 indicating no rating.
[ 116 ]
Collaborative Filtering Chapter 3
user.a user.b
A 3 3
B 0 5
C 2 3
D 5 5
E 5 4
F 0 2
Both of the users agree on ratings for products A, C, D, and E. User user.a, has not seen and
therefore not rated, products B and F. However, we see that user.b has rated those
products. Since both the users agree on most of the products, we can use user.b ratings for
product B, and F, and fill in these missing ratings for user.a
This is the fundamental working premise of collaborative filtering. The example is based on
a user-based collaborative filtering approach. There are other approaches to collaborative
filtering which are shown in the following diagram, but we hope this example serves to
introduce the concept.
Having explained the hypothesis behind the algorithm, let us look at the approaches to
collaborative filtering:
[ 117 ]
Collaborative Filtering Chapter 3
Memory-based approach
The memory-based approach to collaborative filtering loads the whole rating matrix into
memory to provide recommendations, hence the name memory-based model. User-based
filtering is the most prominent memory-based collaborative filtering model. The R snippet
explained in the preceding section is the underlying principle by which memory-based
methods work. As the user and product base grows, scalability is a big issue with memory-
based models.
In the R snippet example, we used the ratings of user.b to fill in the missing ratings for
user.a. In the real world, with thousands of users, the user-based filtering algorithm first
proceeds as follows.
Let us say we have a set of users {u1,u2.....un} and a set of products {p1,p2,...pm}, and we are
trying to predict the ratings for a single user, ua.
Using a similarity measure such as Pearson coefficient, or cosine distance, we try to find
K neighbors for ua. For more on similarity measures, refer to http://reference.wolfram.
com/language/guide/DistanceAndSimilarityMeasures.html:
When the two vectors are centered (zero mean) their cosine similarity is
same as the Pearson coefficient.
[ 118 ]
Collaborative Filtering Chapter 3
Let us say ua does not have the ratings for p5, p7, and p9. We take the average of ratings for
these products from the K neighbors of ua and use that as the ratings for ua.
Model-based approach
The model based approach addresses the scalability problem seen in memory-based
approaches. Compared to the user-based approach, where the recommendations came from
leveraging a user's neighbors preference, the model-based approach leverages product
similarity to make recommendations. The premise is that users will prefer those products
similar to the ones they have already rated.
The first step in this approach is to calculate the product similarity matrix. Let us say there
are a set of products: {p1,p2,...pm}. An m x m matrix is constructed to begin with. Once again
Pearson coefficient or cosine similarity is used as a similarity measure. For efficiency
purposes, instead of building a whole m x m matrix, a smaller m x k matrix is built, where k
<< m, k most similar items.
We have three users, user.a, user.b, user.c, and three products: A, B, C. We also have
the ratings the users have provided for these products. The ratings are in a scale from 1 to 5.
You should see the ratings.matrix, the final rating matrix summarizing the user product
ratings.
[ 119 ]
Collaborative Filtering Chapter 3
We have used the cor function from the base package to calculate the similarity matrix. Let
us say we want to find the ratings for product B for user.a.
The most similar product to product B is A, as it has a higher score compared to C. In a real-
world example, we will have multiple similar products. Now the score for B for user.a is
calculated as follows:
The predicted rating for product B for user.a is calculated using similarity measures
between (B, A) and (B, C) weighted by the respective ratings given by user.a for A and C
and scaling this weighted sum with the sum of similarity measures.
The similarity matrix of products are pre-computed and therefore it does not have the
overhead of loading the whole user ratings matrix in memory. Amazon.com is successfully
using item-based recommendations.
[ 120 ]
Collaborative Filtering Chapter 3
Let us say our original matrix is m x n, where we have m users and n products. Applying
SVD on this matrix, we get:
Applying SVD matrices with missing values is not possible. Our ratings matrix will have a
lot of missing values. Simon Funk proposed a new method, now called funk SVD, to do
SVD on large matrices using numerical optimization methods and gradient descent. Funk
SVD is implemented in the recommenderlab package. The dimension f in m x f and n x f
represents the latent factors, hence the name of this method is the latent factors
method. These latent factors represents some of the hidden features. For example in a
movie database, where our items corresponds to movie, one or a couple of latent
factors may point to the box office success of the movie.
Each user has a vector now from the m x f matrix. Similarly each product has a vector now
from the n x f matrix. For a given user and item, multiplying their respective vectors gives
the rating for that item specific to that user. Can you guess why? Here is where the
features/factors comes to play. The features/factors are shared by both the matrices. The
ratings in the original ratings matrix is now decomposed into a set of features. Thus when
we multiply the user vector, 1 x f, with the product vector 1 x f, we get a scalar value, which
is the rating for that product by that user.
Using this we can now predict the recommendation for unknown products for users.
[ 121 ]
Collaborative Filtering Chapter 3
We use the SVD function from the base R package to decompose the ratings.mat.
ratings.mat is the same ratings matrix we used to explain item-based similarity.
user.a.vector is the vector representing user.a, and product.B.vector is the vector
representing the product B. ratings.user.a.prod.B is the product of these two
vectors representing the ratings of product B by user.a.
Now that we have explained all the three different approaches to build a collaborative
filtering method, let us proceed to look at the recommenderlab package.
Recommenderlab package
It will be very useful for the subsequent sections to get an overview of
the recommenderlab package. Let us quickly look at the S4 objects inside the package and
see how we can use them to build collaborative filtering projects.
The ratingMatrix is an abstract object, used to store and manipulate the ratings matrix.
The term abstract is from object-oriented programming. ratingMatrix defines the
interfaces to develop a user ratingsMatrix, but does not implement them. There are two
concrete implementations of this object one for the real-valued matrix and the other one for
the binary matrix. The Recommender class is used to store the recommendation models. It
takes as an input a ratingsMatrix object and other parameters and builds the required
recommender model.
[ 122 ]
Collaborative Filtering Chapter 3
The predict function can produce recommendations for unseen/unknown data (where we
don't know the recommendation) using the Recommender model.
In the preceding code, we create a random binary matrix and coerce it to a binaryRatingMatrix object.
Similarly, we can also create a realRatingMatrix.
Having built our ratingsMatrix, let us go ahead and create a simple Recommender model
based on this matrix.
[ 123 ]
Collaborative Filtering Chapter 3
.. .. .. ..@ n : int 5
..@ predict :function (model, newdata, n = 10, data = NULL, type =
c("topNList"), ...)
>
We pass our ratings matrix to the Recommender class and select an approach called
popular. This creates our recommendation model.
Popular approach
This is a very simple strategy where the recommendations are based on the item popularity.
Among the products not rated by a user, the most popular ones among all the users are
suggested as recommendations.
There are several other implementations of the recommendation model available in this
package.
$AR_binaryRatingMatrix
Recommender method: AR for binaryRatingMatrix
Description: Recommender based on association rules.
Reference: NA
Parameters:
support confidence maxlen sort_measure sort_decreasing apriori_control
verbose
1 0.1 0.8 3 "confidence" TRUE list()
FALSE
$IBCF_binaryRatingMatrix
Recommender method: IBCF for binaryRatingMatrix
Description: Recommender based on item-based collaborative filtering
(binary rating data).
Reference: NA
[ 124 ]
Collaborative Filtering Chapter 3
Parameters:
k method normalize_sim_matrix alpha
1 30 "Jaccard" FALSE 0.5
$POPULAR_binaryRatingMatrix
Recommender method: POPULAR for binaryRatingMatrix
Description: Recommender based on item popularity.
Reference: NA
Parameters: None
$RANDOM_binaryRatingMatrix
Recommender method: RANDOM for binaryRatingMatrix
Description: Produce random recommendations (binary ratings).
Reference: NA
Parameters: None
$UBCF_binaryRatingMatrix
Recommender method: UBCF for binaryRatingMatrix
Description: Recommender based on user-based collaborative filtering.
Reference: NA
Parameters:
method nn weighted sample
1 "jaccard" 25 TRUE FALSE
Having created the Recommender model, let us see how to leverage it to produce the top N
recommendations.
$`2`
[1] 1
$`3`
[1] 2 5
$`4`
[1] 4 5
> recomms@ratings
NULL
[ 125 ]
Collaborative Filtering Chapter 3
We pass our Recommender model to the predict function. We are testing it on our input
data, and therefore we have the newdata parameter set to our training data. Finally the
parameter n decides how many recommendations we need. The recommendations are
printed in the next line by calling recomm@items. Since our input is a binary matrix, we
don't have any ratings.
Similarly, we can create a realMatrix either from an R matrix or a data frame and create a
Recommender model and proceed with the recommendations.
This concludes our introduction to the recommenderlab package. Let us use this
knowledge to solve the problem in hand for this chapter. For more information about
recommenderlab visit https://cran.r-project.org/web/packages/recommenderlab/
index.html
> str(Jester5k)
Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. ..@ i : int [1:362106] 0 1 2 3 4 5 6 7 8 9 ...
.. .. ..@ p : int [1:101] 0 3314 6962 10300 13442 18440 22513 27512
32512 35685 ...
.. .. ..@ Dim : int [1:2] 5000 100
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : chr [1:5000] "u2841" "u15547" "u15221" "u15573" ...
.. .. .. ..$ : chr [1:100] "j1" "j2" "j3" "j4" ...
.. .. ..@ x : num [1:362106] 7.91 -3.2 -1.7 -7.38 0.1 0.83 2.91
-2.77 -3.35 -1.99 ...
.. .. ..@ factors : list()
..@ normalize: NULL
[ 126 ]
Collaborative Filtering Chapter 3
> head(Jester5k@data[1:5,1:5])
5 x 5 sparse Matrix of class "dgCMatrix"
j1 j2 j3 j4 j5
u2841 7.91 9.17 5.34 8.16 -8.74
u15547 -3.20 -3.50 -9.56 -8.74 -6.36
u15221 -1.70 1.21 1.55 2.77 5.58
u15573 -7.38 -8.93 -3.88 -7.23 -4.90
u21505 0.10 4.17 4.90 1.55 5.53
The data is stored as realRatingMatrix. We can examine the slots of this matrix by
calling the str() function. Looking at the data slot of realRatingMatrix, we see that it's
a matrix with 5,000 rows and 100 columns. The rows are the number of users and the
columns represent the number of jokes. If all users have rated all jokes, we should have a
total of 500,000 ratings. However, we see that we have 362,106 ratings.
As the name of the dataset suggests, Jester5k has 5 k users and their ratings for 100 jokes.
Further, let's look at the first five rows and columns of realRatingMatrix to have a quick
look at the ratings.
Let us look at the ratings provided by two random users, 1 and 100:
> Jester5k@data[1,]
j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 j11 j12
j13 j14 j15 j16 j17 j18
7.91 9.17 5.34 8.16 -8.74 7.14 8.88 -8.25 5.87 6.21 7.72 6.12
-0.73 7.77 -5.83 -8.88 8.98 -9.32
j19 j20 j21 j22 j23 j24 j25 j26 j27 j28 j29 j30
j31 j32 j33 j34 j35 j36
-9.08 -9.13 7.77 8.59 5.29 8.25 6.02 5.24 7.82 7.96 -8.88 8.25
3.64 -0.73 8.25 5.34 -7.77 -9.76
j37 j38 j39 j40 j41 j42 j43 j44 j45 j46 j47 j48
j49 j50 j51 j52 j53 j54
7.04 5.78 8.06 7.23 8.45 9.08 6.75 5.87 8.45 -9.42 5.15 8.74
6.41 8.64 8.45 9.13 -8.79 6.17
j55 j56 j57 j58 j59 j60 j61 j62 j63 j64 j65 j66
j67 j68 j69 j70 j71 j72
8.25 6.89 5.73 5.73 8.20 6.46 8.64 3.59 7.28 8.25 4.81 -8.20
5.73 7.04 4.56 8.79 0.00 0.00
j73 j74 j75 j76 j77 j78 j79 j80 j81 j82 j83 j84
j85 j86 j87 j88 j89 j90
0.00 0.00 0.00 0.00 -9.71 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[ 127 ]
Collaborative Filtering Chapter 3
> Jester5k@data[100,]
j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 j11 j12
j13 j14 j15 j16 j17 j18
-2.48 3.93 2.72 -2.67 1.75 3.35 0.73 -0.53 -0.58 3.88 3.16 1.17
0.53 1.65 1.26 -4.08 -0.49 -3.79
j19 j20 j21 j22 j23 j24 j25 j26 j27 j28 j29 j30
j31 j32 j33 j34 j35 j36
-3.06 -2.33 3.59 0.58 0.39 0.53 2.38 -0.05 2.43 -0.34 3.35 2.04
2.33 3.54 -0.19 -0.24 2.62 3.83
j37 j38 j39 j40 j41 j42 j43 j44 j45 j46 j47 j48
j49 j50 j51 j52 j53 j54
-2.52 5.19 1.75 0.00 0.39 1.75 -3.64 -2.28 2.33 3.16 -2.48 0.19
2.82 4.22 -0.19 3.30 -0.53 3.45
j55 j56 j57 j58 j59 j60 j61 j62 j63 j64 j65 j66
j67 j68 j69 j70 j71 j72
-0.53 0.97 -2.91 -8.25 -0.29 2.52 4.66 3.50 -0.24 3.64 -0.05 1.21
-3.25 1.17 -2.57 -2.18 -5.44 2.67
j73 j74 j75 j76 j77 j78 j79 j80 j81 j82 j83 j84
j85 j86 j87 j88 j89 j90
2.57 -4.03 2.96 3.40 1.12 1.36 -3.01 2.96 2.04 -3.25 1.94 -3.40
-3.50 -3.45 -3.06 2.04 3.20 3.06
j91 j92 j93 j94 j95 j96 j97 j98 j99 j100
2.86 -5.15 3.01 0.83 -6.21 -6.60 -6.31 3.69 -4.22 0.97
>
As we can see in the data, users don't have ratings for all the jokes. The two random users
we have just shown have zero as the rating for some of the jokes.
A user will not have rated all the jokes. If they have not rated it, there will be a zero value.
To get the number of jokes a user rated, we can run:
length(Jester5k@data[100,][Jester5k@data[100,]>0]) # answer = 58
[ 128 ]
Collaborative Filtering Chapter 3
We are looking to see per user the count of jokes he has not rated. We can achieve this by
doing the sum of rows of our ratings matrix. After summing it up, we create a dataframe,
zero.ratings.df, with two columns; the first column is the user and the second column
is the number of zero-entries they have in the ratings matrix, that is, the number of jokes
where their ratings were zero. Further, we can order our dataframe zero ratings in
descending order by the count. We can see that user u3228 has not rated 66 jokes.
Let us use this data to make a histogram to see the underlying distribution:
> hist(zero.ratings.df$count, main ="Distribution of zero rated jokes")
[ 129 ]
Collaborative Filtering Chapter 3
The histogram is showing the count of the zeros (that is, unrated jokes), so a low value on
the x-axis means that users have rated more jokes. The bin 0-5 is a testimony to it. Out of
100 jokes, anything between 0 to 5 is left unrated. However, the distribution looks to have
three modes.
[ 130 ]
Collaborative Filtering Chapter 3
The three modes are evident from the density plot. We have three groups of users in our
database. Those who have a low number for unrated jokes, another group which has
around 25 jokes unrated and the final group which has around 65 jokes unrated.
We can fact check our user clusters by running a k-means algorithm on our data. We set the
parameter k to the number of clusters and finally collect our results in a dataframe named
model.df. The cluster centers reflect the number of jokes not rated. More information
about R k-means can be found at https://stat.ethz.ch/R-manual/R-devel/library/
stats/html/kmeans.html
[ 131 ]
Collaborative Filtering Chapter 3
Now that we've looked at the user distribution, let us proceed to look at the joke's ratings:
> Jester5k@data[,1]
u2841 u15547 u15221 u15573 u21505 u15994 u238 u5809 u16636 u12843
u17322 u13610 u7061 u23059 u7299
7.91 -3.20 -1.70 -7.38 0.10 0.83 2.91 -2.77 -3.35 -1.99
-0.68 9.17 -9.71 -3.16 5.58
u20906 u7147 u6662 u4662 u5798 u7904 u7556 u3970 u999 u5462
u20231 u13120 u22827 u20747 u1143
9.08 0.00 -6.70 0.00 1.02 0.00 -3.01 5.87 -7.33 0.00
7.48 0.00 -9.71 0.00 0.00
u11381 u6617 u7602 u12658 u4519 u18953 u5021 u6457 u24750 u20139
u13802 u16123 u7778 u15509 u8225
0.00 6.55 0.00 0.00 0.78 -0.10 0.00 0.00 0.00 -6.65
2.28 1.02 0.00 -8.35 5.53
u12519 u16885 u12094 u6083 u19086 u1840 u7722 u17883 u12579 u3815
u12563 u12313 u18725 u4354 u21146
....................
[ reached getOption("max.print") -- omitted 4000 entries ]
>
> par(mfrow=c(2,2))
> joke.density <- density(Jester5k@data[,1][Jester5k@data[,1]!=0])
> plot(joke.density)
We look at the first joke, Jester5k@data[,1], for its scoring values; the output shown in
the preceding code is truncated.
Further, we plot the density plot for four randomly selected jokes (1, 25, 75, and 100) and
look at the distribution of scores.
[ 132 ]
Collaborative Filtering Chapter 3
For all the four jokes, we see the rating is more than zero. The recommenderlab package
provides a function, getRatings, which can work on the s3 object to retrieve the ratings.
[ 133 ]
Collaborative Filtering Chapter 3
Let us now move on to see if we can find the most popular joke.
[ 134 ]
Collaborative Filtering Chapter 3
joke pratings
j80 j80 1072
j90 j90 1057
j73 j73 1041
j77 j77 1012
j86 j86 994
j79 j79 934
j75 j75 895
j71 j71 796
j58 j58 695
j74 j74 689
>
We begin with binarizing our ratings matrix. In the new matrix, ratings.binary , the
ratings with 0 or more will be considered as positive ratings and all the others will be
considered as negative ratings. We create a dataframe, ratings.sum.df, with two
columns: the joke name and the sum of the ratings received by the joke. Since we have
binarized the matrix, this sum should be equal to the popularity of a joke. Displaying the
matrix in descending order of the sum, we see the most popular jokes and the least popular
jokes.
Finally, we are going to sample the datasets for 1,500 users and use that as our dataset for
the rest of the chapter. Sampling is often used during exploration, by taking a smaller
subset of the data, you can explore the data and produce the initial models quicker. Then
when the time comes, you can apply the best approach to the entire dataset
[ 135 ]
Collaborative Filtering Chapter 3
The following is the image for the distribution of ratings for 1500 users:
Hopefully, the data exploration we performed in this section has given you a good
overview of the underlying dataset. Let us proceed now to build our joke recommendation
system.
[ 136 ]
Collaborative Filtering Chapter 3
The steps in designing our recommendation project are shown in the following figure:
Ratings matrix
The first step is to obtain the ratings matrix. The recommenderlab expects the user rating
matrix to be stored as either binaryRatingsMatrix or realRatingsMatrix.
The realRatingsmatrix s3 class has a slot called data where the actual ratings matrix is
stored in a compressed format. The getRatingMatrix function is an easy way to extract
this matrix from the s3 class.
[ 137 ]
Collaborative Filtering Chapter 3
..@ Dimnames:List of 2
.. ..$ : chr [1:1500] "u24377" "u11509" "u6569" "u19311" ...
.. ..$ : chr [1:100] "j1" "j2" "j3" "j4" ...
..@ x : num [1:108531] 7.72 0.68 8.35 -1.94 -4.61 2.18 7.96 2.23
8.06 -0.53 ...
..@ factors : list()
>
First we take a sample of 1,500 users from our dataset. Using getRatingMatrix, we extract
the realRatingsMatrix, our ratings matrix from this dataset. We see that it's stored as
dgCMatrix, a compressed matrix form.
As you can see, using the slot data from our dataset, we can extract the ratings matrix.
Now that we have obtained the ratings matrix, we proceed to the next step of
normalization.
When you have your own dataset, you can either create a matrix or a
dataframe out of your dataset and coerce them to realRatingsMatrix or
realBinaryMatrix, as shown in the previous section where we
introduced the recommenderlab infrastructure.
[ 138 ]
Collaborative Filtering Chapter 3
Normalization
Our next step is to normalize the ratings. Currently, our ratings range from -10 to 10. For
some of the underlying algorithms to work efficiently, it is generally a good practice to
normalize the data. One reason for normalizing the data is to account for the different
biases people use when rating items. Users do not rate items consistently, one user may
mainly only recommend items they such as, (that is, mostly 4/5, 5/5) whereas another user
may use rate across the range. Normalizing the data accounts for these biases.
1. The first one is centering; it removes the row bias by subtracting the row mean
value from all the row values.
2. The second one is the z-score normalization; a z-score is obtained by subtracting
the mean from individual scores and dividing it by the standard deviation.
Centering as the name indicates makes the mean zero. It does not change the scale of the
variable. If all the variables in the dataset were measured in the same scale, we can adopt
centering as our normalization technique.
On the contrary, z-score normalization changes the scale in addition to centering the data.
Now a unit change in the data is a one standard deviation change in the data. When the
variables are measured in different scale, it makes sense to use z-score normalization.
The normalize function is used to normalize the data. The method parameter dictates if
we need to use the centering or z-score normalizing approach. We further proceed to plot
the ratings distribution.
[ 139 ]
Collaborative Filtering Chapter 3
The plots of the raw, normalized, and z-score normalized ratings are shown in the
following figure:
From the plots, we can see that we have brought the data close to a normal distribution
from its original distribution.
Let's proceed to the next step, where we want to arrange our data in such a way that we can
evaluate the performance of our Recommender model.
[ 140 ]
Collaborative Filtering Chapter 3
The recommenderlab package provides the necessary infrastructure to achieve this. The
evaluationScheme class is provided to help us create a train/test strategy. It takes a
ratingsMatrix as an input and provides several schemes including simple split, bootstrap
sampling, and k-fold cross-validation to create an evaluation scheme.
Using the evaluationScheme method, we allocate 90% of our data to training. The
given parameter achieves the following. From the test data, it takes 10 jokes for each user
and keeps them aside for evaluation.The model will predict the rating for these ten jokes of
the test records and these can be compared to the actual values. That way we can evaluate
the performance of our recommender systems. Finally, the parameter goodRating, defines
the threshold for good rating, here we say any rating greater than or equal to 1 is a positive
rating. This is critical for evaluating our classifier. Let us say if we want to find the accuracy
of our recommender system, then we need to find out the number of ratings where our
predictions matched with the actual. So if the ratings of both actual and predicted is greater
than or equal to 1, then they are considered as a match. We choose 2 here looking at the
density plot of z-score normalized data.
The method parameter is very important as this decides the evaluation scheme. In our
example, we have used split. In split it randomly assigns objects to train or test based on
the proportion given.
[ 141 ]
Collaborative Filtering Chapter 3
Another important parameter we did not include is k. By default, the value of k is set to 1
for the split method. It decides the number of times to run the evaluation scheme. If we
had selected k-fold cross-validation as our method, k defaults to 10. A total of 10 different
models will be built for 10 splits of the data and the performance will be an average of all
the 10 models.
As you can see in a three-fold validation, we have three splits of our train and test data. The
model is evaluated against all the three splits.
[ 142 ]
Collaborative Filtering Chapter 3
Now that we have built our scheme, we can use the scheme to extract our test and train
dataset.
The getData function is used to extract the train, and test dataset. The type parameter
returns different datasets. The train type returns the training dataset. The known type
returns the known ratings from the test dataset. The unknown type returns the ratings used
for evaluation of the dataset.
This is a standard technique followed in any machine learning algorithm development. The
train dataset is used to train the model and the test dataset is used to test the performance
of the model. If the model performs poorly, we go back to using the train dataset and tune
the model. Tuning may involve either changing the approach (that is, instead of using user-
based models, let us say we move to latent-based models and evaluate the performance) or
we start changing some of the parameters of our existing approach. Say in the case of user-
based filtering, we change the number of neighbors.
Our test dataset is further divided into test.known and test. The dataset test.known,
the ratings for 10 products. Using those ratings, our recommender system will predict the
recommendations. These recommendations can be now compared with the actual value
from test data set.
[ 143 ]
Collaborative Filtering Chapter 3
Out of a total of 1,500 users, 90% of the data (that is 1,350), is allocated to train. A set of 150
users are reserved for the purpose of testing.
Train model
We have our data split into the following:
train: The user matrix we will use to build our recommendation model.
test.known: We will feed this for our predict method along with our model. The
output can now be compared to our test dataset.
test: The test dataset is used for evaluating our model
Using these, let us go ahead and build our recommender system. For the first model, we are
going to build a random model.
$labels
[1] "j1" "j2" "j3" "j4" "j5" "j6" "j7" "j8" "j9" "j10"
"j11" "j12" "j13" "j14"
[15] "j15" "j16" "j17" "j18" "j19" "j20" "j21" "j22" "j23" "j24"
"j25" "j26" "j27" "j28"
[29] "j29" "j30" "j31" "j32" "j33" "j34" "j35" "j36" "j37" "j38"
"j39" "j40" "j41" "j42"
[43] "j43" "j44" "j45" "j46" "j47" "j48" "j49" "j50" "j51" "j52"
"j53" "j54" "j55" "j56"
[57] "j57" "j58" "j59" "j60" "j61" "j62" "j63" "j64" "j65" "j66"
"j67" "j68" "j69" "j70"
[71] "j71" "j72" "j73" "j74" "j75" "j76" "j77" "j78" "j79" "j80"
"j81" "j82" "j83" "j84"
[85] "j85" "j86" "j87" "j88" "j89" "j90" "j91" "j92" "j93" "j94"
"j95" "j96" "j97" "j98"
[99] "j99" "j100"
>
[ 144 ]
Collaborative Filtering Chapter 3
> random.predict@ratings[1]
[[1]]
[1] 2.516658 2.139329 1.921027 1.900945 1.845772
Using the Recommender function, we have built a random model. Inspecting the model, we
see that this model will produce a random number between -5.9 and 5.2 as defined by the
range. The model does not learn anything, except the range of the ratings in the training set
and produces the requested rating.
The predict function is used to predict the ratings. We need the top N predictions, that is
the top five jokes the user would have rated. Is it the top 5 jokes of the 10 known (i.e.
holdout) jokes for a specific user? This is specified by the n parameter and the
type parameter. Finally, we can look into our predictions, random.predict, to find our
recommendations and ratings for an individual user. In the preceding code snippet, we look
for user 1. We see that jokes 81, 96, 47, 28, and 65 are recommended for this user.
Its a best practice to build a reference model to begin with before we start
building our actual models. When we have our first model, we can
compare it with our reference model to see some performance gain. Any
good model should be better than a random model. Once we have a good
model we can replace the random model with the good model and
proceed to find better models. Random models, since they are very easy to
create are typically used as the first reference model.
Let us build a more sensible model. The one we have seen in the previous sections, called
the popular model.
[ 145 ]
Collaborative Filtering Chapter 3
We build a model using the popular type. We have trained it with 1,350 user data.
> popular.predict@ratings[1]
$u4027
[1] 3.472511 3.218649 3.157963 3.011908 2.976325
>
Using the predict function and using the test.known dataset, we try to predict the top
five recommendations for users in the test dataset. Remember that in the known
parameter, we had reserved 10 recommendations for each user, stored in test.known.
Using that we are now predicting other jokes for the user.
Before we proceed to build a better model, let us introduce the evaluate function.
Till now we had explicitly invoked the Recommender and create a recommender object.
Further passed that model to predict function to get our predictions. This is the correct
way of implementing a model. However, when are evaluating different models, we need a
convenient way to run and test multiple models.
[ 146 ]
Collaborative Filtering Chapter 3
The evaluate function can take an evaluation scheme and apply it to a given model and
produce the performance of the recommendation model. The performance is given using
multiple metrics.
>
As you can see, we have passed the plan, which is our evaluation scheme and the method.
We request five recommendation for each user using parameters k and type.
While comparing the ratings, the parameter goodRating is used. Let us say we have set
our goodRating parameter to 5. If the predicted rating is 5.1 and the actual rating is 7.1,
they are considered as a match. Our recommendation is considered as a match with the test
set. This way we can calculate the metrics we have detailed.
TP stands for True Positive, the number of instances where our recommendation matched
with the actual recommendation.
[ 147 ]
Collaborative Filtering Chapter 3
FP stands for False Positive, the number of instances where we made a recommendation,
but the actual data shows otherwise.
FN stands for False Negative, where we did not recommend, but the actual data shows
otherwise.
TN stands for True Negative, where we did not recommend and the actual data was in
agreement with us.
Precision is the ratio of the number of correct recommendations out of the total
recommendations.
Recall is the ratio of the number of correct recommendations out of the total correct
recommendations.
TPR stands for True Positive Rate, the number of true positives to the total positives ( true
positive + false negative).
FPR stands for False Positive Rate, the ratio of false positives to the sum of false positives
and true negatives.
Until now we have used the split scheme; now let us use the n fold cross-validation
scheme. Further, we will be only using the evaluate function to find the performance of
our recommendation model.
[ 148 ]
Collaborative Filtering Chapter 3
Instead of the train test split scheme, here we are doing a cross-validation. By default, it
does a 10-fold cross-validation. Finally, our results is an average of 10 cross-validations for
three different top Ns. As you can see, we have passed n a vector with three values, 5, 10,
and 15, so we can generate 5, 10, or 15 recommendations.
User-based models
We discussed the the user-based model in the previous sections. Let us do a quick recap
here. The user-based model for collaborative filtering tries to mimic the word-of-mouth
approach in marketing. It is a memory-based model. The premise of this algorithm is that
similar users will have a similar taste in jokes. Therefore, they will rate jokes in a more or
less similar manner. It's a two-step process, in the first step for a given user, the algorithm
finds his neighbors. A similarity distance measure, such as Pearson coefficient, or cosine
distance, is used to find the neighbors for a given user. For an item not rated by the user,
we look at whether the user's neighbors have rated that item. If they have rated an average
of his the neighbors' ratings are considered as the ratings for this user.
With our input data prepared, let us proceed to build the model.
[ 149 ]
Collaborative Filtering Chapter 3
1 [0.017sec/0.268sec]
2 [0.016sec/0.267sec]
3 [0.01sec/0.284sec]
4 [0.01sec/0.273sec]
5 [0.009sec/0.273sec]
6 [0.009sec/0.272sec]
7 [0.009sec/0.508sec]
8 [0.009sec/0.236sec]
9 [0.009sec/0.268sec]
10 [0.01sec/0.262sec]
> avg(results)
TP FP FN TN precision recall TPR
FPR
5 2.024000 2.976000 14.40600 70.59400 0.4048000 0.1586955 0.1586955
0.03877853
10 3.838667 6.161333 12.59133 67.40867 0.3838667 0.2888018 0.2888018
0.08048999
15 5.448000 9.552000 10.98200 64.01800 0.3632000 0.3987303 0.3987303
0.12502479
>
Using our framework defined in the previous model, we can create a user-based
recommendation system, as shown in the previous code, and evaluate its performance. We
have used the cross-validation scheme to evaluate our model's performance.
We call the evaluate method with our cross-validation scheme and specify the user-based
model by the parameter method. UBCF stands for user-based recommendation. Once
again we are interested only in the top N recommendation and our N is now an array of
three values: 5, 10, and 15. We want to evaluate our model for all the three Ns. Therefore,
we have passed an array of values. Finally, when we see the model performance using the
results object, the performance is averaged across 10 models, for all the three Ns we have
supplied.
An alternate way to evaluate the model is to make the model do the recommendation for all
the unknown items in the test data. Now compare the difference between the predicted and
actual ratings and show a metric, such as root mean square error, or mean absolute error or
squared error.
[ 150 ]
Collaborative Filtering Chapter 3
5 [0.01sec/0.221sec]
6 [0.009sec/0.213sec]
7 [0.013sec/0.247sec]
8 [0.009sec/0.401sec]
9 [0.011sec/0.242sec]
10 [0.009sec/0.243sec]
> avg(results.1)
RMSE MSE MAE
res 4.559954 20.80655 3.573544
>
Here we are saying for our user-based model, we need to normalize the data. Sine the scale
of the rating is same across the users ( from -10 to +10) we want to only bring the data to a
zero mean. Further, we want to use the Pearson coefficient as our distance measure. Finally,
we want to use 10 neighbors to get our recommendation.
Also since we have 100 jokes, we can have around 30 jokes in our test.known.
We can see that our RMSE has gone down. Similarly, we can make some more changes and
tune our model.
[ 151 ]
Collaborative Filtering Chapter 3
Item-based models
This is a model-based approach. From a given ratings matrix, this method explores the
relationship between the items. Based on the ratings, different users provide different items,
and an item-to-item similarity matrix is derived. Once again, as in the user-based model,
Pearson coefficient or cosine distance is used as a similarity metric. For each item, we store
the top K similar items, rather than storing all the items for efficiency purposes. A weighted
sum idea is used to finally make a recommendation for a user. Refer to the paper from
Amazon for more about item-based filtering: https://dl.acm.org/citation.cfm?id=
642471
As you can see, the class structure of recommenderlab is very elegant; the only change we
made was to use IBCF as the value to the method parameter.
[ 152 ]
Collaborative Filtering Chapter 3
The RMSE is higher than the user-based model. Which is not surprising. Item-based model
typically performs slightly poorer than the user-based model. The trade off is memory
usage. User-based model consumes more memory but gives better results.
Factor-based models
Factor-based models leverage matrix decomposition techniques to predict
recommendations for unrated items.
Once again the steps are the same, except the value for the method parameter.
[ 153 ]
Collaborative Filtering Chapter 3
3 [33.355sec/1.456sec]
4 [34.085sec/1.505sec]
5 [33.442sec/1.356sec]
6 [34.764sec/1.328sec]
7 [33.307sec/1.232sec]
8 [34.139sec/1.484sec]
9 [33.132sec/1.359sec]
10 [35.274sec/1.342sec]
> avg(results.1)
RMSE MSE MAE
res 5.086455 25.88602 3.853208
>
Our RMSE error is much higher than both the item-based and the user-based models.
Let us now build a complete model, using our knowledge from the above three
experiments.
To being with let us normalize the data and then split it into test and train:
data <- Jester5k
data <- normalize(data, method = "center")
plan <- evaluationScheme(data, method="split", train=0.9, given = 10,
goodRating=5)
train <- getData(plan, "train")
test <- getData(plan, "unknown")
test.known <- getData(plan, "known")
We saw that, Pearson with 10 nearest neighbors worked well, we will use those to build our
final model.
[ 154 ]
Collaborative Filtering Chapter 3
[1] 66 36 35 54 42
> final.predict@ratings[1]
$u7147
[1] 2.746771 2.742437 2.722453 2.297228 1.974920
We have shown in this section, how to build all three types of recommender systems. We
finally chose the user-based system as it out-performed all the other systems in RMSE.
The next section gives the complete list of code we have used in this chapter.
Complete R Code
The complete project code is as follows:
set.seed(100)
products <- c('A','B','C','D','E','F','G')
user.a <- c( 3, 0, 2, 5, 5, 0,1)
user.b <- c( 3, 5, 3, 5, 4, 2, 1)
[ 155 ]
Collaborative Filtering Chapter 3
install.packages("recommenderlab")
# A quick recommender
model <- Recommender(data = rating.mat, method = "POPULAR")
model
str(model)
recommenderRegistry$get_entries(dataType = "binaryRatingMatrix")
# Quick Prediction
recomms <- predict(model, newdata = rating.mat, n =2)
recomms@items
recomms@ratings
data("Jester5k")
str(Jester5k)
head(Jester5k@data[1:5,1:5])
# Users
Jester5k@data[1,]
Jester5k@data[100,]
[ 156 ]
Collaborative Filtering Chapter 3
# Ratings
Jester5k@data[,1]
par(mfrow=c(2,2))
joke.density <- density(Jester5k@data[,1][Jester5k@data[,1]!=0])
plot(joke.density)
par(mfrow=c(1,1))
nratings(Jester5k)
hist(getRatings(Jester5k), main="Distribution of ratings")
# Popular joke
# Binarize the ratings
ratings.binary <- binarize(Jester5k, minRating =0)
ratings.binary
ratings.sum <- colSums(ratings.binary)
ratings.sum.df <- data.frame(joke = names(ratings.sum), pratings =
ratings.sum)
head( ratings.sum.df[order(-ratings.sum.df$pratings), ],10)
tail( ratings.sum.df[order(-ratings.sum.df$pratings), ],10)
# Sample
data <- sample(Jester5k, 1500)
hist(getRatings(data), main="Distribution of ratings for 1500 users")
##
data <- sample(Jester5k, 1500)
ratings.mat <- getRatingMatrix(data)
str(ratings.mat)
[ 157 ]
Collaborative Filtering Chapter 3
str(data@data)
par(mfrow=c(3,1))
plot(density(getRatings(data)), main = "Raw")
plot(density(getRatings(data.norm)), main = "Normalized")
plot(density(getRatings(data.norm.z)), main = "Z-score normalized")
par(mfrow=c(1,1))
# Popular Model
popular.model <- Recommender(train, "POPULAR")
popular.model
[ 158 ]
Collaborative Filtering Chapter 3
test.known@data[1,c(50,36,27,32,35)]
# Cross validation
plan <- evaluationScheme(data, method="cross", train=0.9, given = 10,
goodRating=5)
results <- evaluate(plan, method = "POPULAR", type = "topNList", n =
c(5,10,15) )
avg(results)
param=list(normalize="center",method="Pearson",nn=10)
plan <- evaluationScheme(data, method="cross", train=0.9, given = 30,
goodRating=5)
results.1 <- evaluate(plan, method ="UBCF", param=param, type ="ratings")
avg(results.1)
[ 159 ]
Collaborative Filtering Chapter 3
goodRating=5)
results <- evaluate(plan, method = "IBCF", type = "topNList", n =
c(5,10,15) )
avg(results)
# final model
data <- Jester5k
param=list(method="Pearson",nn=10)
final.model <- Recommender(train, method = "UBCF", param = param)
final.model
test@data[1,]
test@data[1,c(45,80,10,25,77)]
[ 160 ]
Collaborative Filtering Chapter 3
Summary
We looked at what is a collaborative filtering method and went into different collaborative
filtering strategies. We introduced the R package recommenderlab to perform
collaborative filtering. We leveraged the Jester5k dataset to demonstrate a collaborative
filtering algorithm. We looked at the random model, popular model, item-based similarity,
user-based similarity models, and factor models. We introduced the concept of evaluating
the performance of a recommender system before deploying it. We demonstrated the steps
to split the datasets in order to evaluate our model performance.
In the next chapter, we will be looking to build Deep Neural networks using the MXNet
package for time series data. We will introduce the package MXNet R and proceed to build a
deep neural network.
[ 161 ]
Taming Time Series Data Using
4
Deep Neural Networks
With the advent of the Internet of Things, more and more devices are being added to the
World Wide Web every day. Today's world is rapidly becoming inundated with
intelligence devices. These devices are producing data at a rapid pace which we have never
seen before. The storage cost has significantly come down, allowing companies to store this
massive data in a cheap manner. Smart City projects have been kick-started by various
governments. These Smart City projects rely heavily on this instrumentation. Technology
companies and governments now want to leverage this massive data, generated from these
smart devices, to perform analytics and produce data-enabled applications. The end use
cases of these analytics are limitless. The data is characterized by three factors:
Among all the different varieties of data, data from sensors is the most widespread and is
referred to as time series data.
A time series is a series of data points indexed (or listed or graphed) in time order.
Most commonly, a time series is a sequence taken at successive equally spaced
points in time. Thus it is a sequence of discrete-time data.
Taming Time Series Data Using Deep Neural Networks Chapter 4
Time series data is not a new type of data. For decades, financial industries have been using
this data for various market-related purposes. Algorithmic trading is the use of computer
programs to generate and follow a defined set of instructions to trade in the market. This
allows profit to be generated at a greater speed and frequency compared to a human trader.
Algorithmic trading leverages time series data. The traditional approach to time series data
involves analyzing it in frequency domains or temporal domains. Today, deep learning has
opened up a whole new door to analyzing time series data at a never before seen speed and
level of precision.
Time series data is not just limited to sensor data and financial data. In many other real-
world applications, such as speech recognition, machine translation, and sequence
generation, data is captured in a temporal fashion. Two points in a time series may look
identical, but they may have different temporal dependencies, thus making them belong to
two different categories.
Deep learning allows us to build sophisticated models, which can capture non-linear
relationships in the data in a much more efficient and faster manner. Deep learning can be
defined in two ways.
Any neural network with more than three hidden layers can be defined as a deep learning
network. Different layers in the network capture different properties of the data. The nature
of deep learning is very valuable to time series analysis.
The other definition is the series of hacks performed on these networks to incorporate
bagging and boosting functionalities, thus allowing us to build an ensemble of models
through a single network. Once again, each model in the ensemble is now capable of
understanding different regions of the data. This is very useful for analyzing time series
data.
The aim of this chapter is to introduce MXNet R to our readers and show them how fully
connected deep networks can be effective in predicting time series data. Our treatment of
traditional time series data analysis will be very limited. We will provide sufficient
references for traditional time series analysis.
[ 163 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
The code for this chapter was written in RStudio version 0.99.491. It uses R version 3.3.1. As
we work through our example, we will introduce the MXNet R package that we will be
using. During our code descriptions, we will be using some of the output printed in the
console. We have included what was printed in the console immediately following the
statement that prints the information to the console, to not disturb the flow of the code.
Using the scan function, we get the data from the URL. Following that, we create a time
series object using ts.
[ 164 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
The standard R plot function knows how to plot time series data. The time series plot is as
shown in the following diagram:
An additive model is used to estimate the trend of a non-seasonal time series. The time
series can be smoothed to remove the trend using methods such as moving averages. The
trend can be either increasing or decreasing.
par(mfrow=c(2 ,2))
plot(SMA(king.ts, n=2), main = "n=2")
plot(SMA(king.ts, n=5), main = "n=5")
plot(SMA(king.ts, n=10), main = "n=10")
plot(SMA(king.ts, n=15), main = "n=15")
par(mfrow=c(1,1)
[ 165 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
The TTR library provides the SMA function to perform the smoothing of the data. The n
parameter defines the order of the moving average, which means how many points we
want to consider to do our moving average.
We will use another small dataset to show seasonality—the number of births per month in
New York City from January 1946 to December 1959:
> births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
Read 168 items
> births.ts <- ts(births, frequency = 12)
> births.comps <- decompose(births.ts)
[ 166 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Using the decompose function, we split the trend, seasonality, and error from the time
series. Let us now plot these components using plot(births.comps):
We see the individual components, trends, seasonality, and errors in the graph.
In the previous section, we looked at non-seasonal time series with trend and error
components, and seasonal time series with an additional seasonal component. These are
non-stationary aspects of the time series. The time series has to be made stationary before
we use any regression methods.
Statistical stationarity
A stationary time series is one whose statistical properties, such as mean,
variance, autocorrelation, and so on, are all constant over time.
[ 167 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Let us do a simple regression exercise with our time series data. We are going to predict the
age of death of a king based on the age of death of the previous king. It's a simple example
and hence may sound very strange. The idea is to show how regression is used to do time
series forecasting.
We are using the quantmod package to get the lag of the time series. Our predictor is the
age of death of the previous king. Hence, we take a lag of 1. Finally, our data
frame, new.data, holds our x and y.
[ 168 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Call:
lm(formula = y ~ Lag.1, data = new.data)
Coefficients:
(Intercept) Lag.1
2.8651 0.9548
>
[ 169 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
This plot shows whether or not the residuals are normally distributed. The residuals follow
a straight line.
This plot shows whether or not the residuals have non-linear patterns. We can see equally
spread residuals around the red horizontal line, without distinct patterns, which is a good
indication that we don't have non-linear relationships.
We will stop this section here. Hopefully, it gave you an idea about how to rearrange time
series data to suit a regression problem. Some of the techniques shown here, such as lag,
will be used later when we build our deep learning model.
For more about time series regression, refer to the book Forecasting: Principles and
Practices at https://www.otexts.org/fpp/4/8. It introduces some basic time series
definitions. More curious readers can refer to A Little Book of R for Time Series at https://a-
little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html.
[ 170 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
This network architecture has a single hidden layer with two nodes. The output layer is
activated by a softmax function. This network is built for a classification task. The hidden
layer can be activated by tanh, sigmoid, relu, or soft relu activation functions. The
activation function performs the key role of introducing non-linearity in the network.
The product of weight and the input from the previous layer summed up with the bias is
passed to these activation functions.
[ 171 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Compared to the multi-layer perceptron, we can see that there are several hidden layers.
For the sake of clarity in the diagram, we have only two nodes in each hidden layer, but in
practice there may be several hundred nodes.
As you can see, the deep learning networks are different from their sibling single-hidden-
layer neural networks in terms of their depth, that is, the number of hidden layers. The
input data is now passed through multiple hidden layers, on the assumption that each layer
will learn some aspects of the input data.
So when do we call a network deep? Networks with more than three layers (including input
and output) can be called deep neural network.
[ 172 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
As previously mentioned, in deep learning networks the nodes of each hidden layer learn to
identify a distinct set of features based on the previous layer’s output. Subsequent hidden
layers can understand more complex features as we aggregate and recombine features from
the previous layer:
Let us see how intuitively a feedforward neural network learns from the data.
Say we are solving a classification task. Our input X is a 2 x 2 matrix and our output Y is a
vector of the binary response, either one or zero.
The two nodes in our input layer are fully connected to the two nodes in the hidden layer.
We have four weights going from our input node to our hidden node. We can represent our
weights using a 2 x 2 matrix. We have bias across the two nodes in our hidden layer. We
can represent our bias using a 1 x 2 matrix.
Again, we connect our hidden nodes fully to the two nodes of our output layer. The weight
matrix is again 2 x 2 and the bias matrix is again 1 x 2.
[ 173 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
{w1,w2,w3,w4} and {b1,b2} are the weights and bias for the input to hidden nodes.
{v1,v2,v3,v4} and {bb1,bb2} are the weights and bias for the hidden to output nodes.
It is very important to initialize the weights and bias randomly. Xavier initialization is a
popular initialization used widely today. See this blog for a very intuitive explanation on
Xavier initialization, at http://andyljones.tumblr.com/post/110998971763/an-
explanation-of-xavier-initialization.
Forward cycle
The input x1, x2 is multiplied by the weights w1..w4 in each node and the respective bias is
added. This is the linear transformation:
[ 174 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Now these activations are multiplied by v1..v4 and the respective bias is added:
Backward cycle
The training of the network happens in the backward cycle, hence the name
backpropagation algorithm. The difference between the actual and predicted value error is
calculated. This error is walked back over the model and the weights in the model are
adjusted. The derivative of the errors with respect to these weights gives us an idea as to
how much these weights contributed to the overall error. That percentage of contribution is
adjusted in the weight. Refer to https://en.wikipedia.org/wiki/Backpropagation to
learn more.
Hopefully, this gave you an intuitive explanation on how neural networks learn from data.
Let us go to the next section to learn more about the MXNet R library we will be using to
build our deep learning network.
[ 175 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
In MXNet, NDArray is the basic operation unit. It's a vectorized operation unit for matrix
and tensor computations. All operations on this operation unit can be run on either the CPU
or GPUs. The most important point is that all these operations are parallel. It's the basic data
structure for manipulating and playing around with data in MXNet. It supports a wide
range of mathematical operations, allowing for the writing of machine learning programs in
an imperative fashion. Let us look at some basic NDArray operations.
We have created two matrices of size 3 x 3, one filled with zeros and the other filled with
ones.
The context defines where the code is executed, either in the CPU or GPU. Let us look at the
context:
> mx.ctx.default()
$device
[1] "cpu"
$device_id
[1] 0
$device_typeid
[1] 1
attr(,"class")
[1] "MXContext"
As my machine does not include a GPU, the only context available is CPU.
[ 176 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
We pass the context while creating the arrays. Here, we have passed the GPU context,
specifying that we need this 3 x 3 zero filled array to be created in the GPU.
The code demonstrates matrix multiplication and element-wise addition in the matrix.
The code generates data from a normal distribution for a given mean and standard
deviation.
[ 177 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
$b
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> pexec$outputs
$`_plus4_output`
[,1] [,2] [,3]
[ 178 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
[1,] 2 2 2
[2,] 2 2 2
[3,] 2 2 2
We create an executor by calling mx.simple.bind for our symbol c. In order to create it,
we need to tell the executor the shape of a and b, which are arguments to c. After that,
using mx.exec.update.arg.arrays, we push the real data into the executor for it to
execute the symbol c. The output slot of the executor has the results stored.
Let us look at another example, where we create a symbol d, which is a dot product of two
matrices:
> d <- mx.symbol.dot(a, b)
> arg_lst <- list(symbol = d, ctx = mx.ctx.default(), a = dim(ones.matrix),
+ b= dim(ones.matrix), grad.req="null")
> pexec <- do.call(mx.simple.bind, arg_lst)
> pexec
C++ object <0x1170550a0> of class 'MXExecutor' <0x101be9c30>
> input_list <- list(a = ones.matrix,b = ones.matrix)
> mx.exec.update.arg.arrays(pexec, input_list)
> mx.exec.forward(pexec)
> pexec$arg.arrays
$a
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
$b
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> pexec$outputs
$dot3_output
[,1] [,2] [,3]
[1,] 3 3 3
[2,] 3 3 3
[3,] 3 3 3
[ 179 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Refer to https://github.com/apache/incubator-mxnet/tree/master/R-package/
vignettes for R vignettes and other operations using MXNet.
With this background in imperative and symbolic use of MXNet R, let us go ahead and build
a simple multi-layer perceptron to solve the famous XOR gate problem. We want the
network to learn the XOR truth table:
X Y Z
0 0 0
0 1 1
1 0 1
1 1 0
Our network architecture is as follows:
[ 180 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
X <- data.matrix(mx.nd.concat(list(x.1,x.2,x.3,x.4)) )
Y <- c(y.1,y.2,y.3,y.4)
We have used four normal distributions to generate our data points. Finally, we combine
them into one array, X. The labels are stored in the Y variable.
Now let us go ahead and define our network architecture to solve our XOR problem:
############## Define the Network #########
# Input layer
data <- mx.symbol.Variable("data")
# Hidden Layer
hidden.layer <- mx.symbol.FullyConnected(data = data
, num_hidden = 2)
# Output Layer
out.layer <- mx.symbol.FullyConnected(data = act
, num.hidden = 2)
# Softmax of output
out <- mx.symbol.SoftmaxOutput(out.layer)
We use relu activation for this layer. mx.symbol.Activation is used to create this
activation. We pass the previous hidden layer to this to say this activation function is tied to
the previous hidden layer.
[ 181 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Softmax activation
Softmax activation is most frequently used for classification tasks. Softmax rescales the
output from the previous layer. First, it calculates the exponential of the input to its neuron
and then it divides the total sum of input with all of the neurons in the layer, so that the
activation sums up to one and lies between zero and one.
Softmax activation is used as the last layer in a deep learning neural network for multi-class
classification. The layer has the same number of nodes as the number of classes and it
rescales the output so that it adds up to 1.0, therefore calculating the probability for each
class, that is P(y|x).
[ 182 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
[14] Train-accuracy=0.912
[15] Train-accuracy=1
[16] Train-accuracy=1
[17] Train-accuracy=1
[18] Train-accuracy=1
[19] Train-accuracy=1
[20] Train-accuracy=1
> X_test = data.matrix(rbind(c(0,0),c(1,1),c(1,0),c(0,1) ) )
> preds = predict(model, X_test, array.layout = "rowmajor")
> pred.label <- max.col(t(preds)) -1
> pred.label
[1] 0 0 1 1
Finally, we pass the model to the predict function to see the predictions.
Wow, we have finished building a neural network with 100% accuracy for our training
data. You can look at the prediction results from the model for a simple XOR truth table.
We can further investigate our model by looking at the graph generated by our symbolic
programming:
> graph.viz(model$symbol)
> model$arg.params
$fullyconnected4_weight
[,1] [,2]
[1,] -1.773083 1.751193
[2,] -1.774175 1.751324
$fullyconnected4_bias
[1] 1.769027 -1.754247
$fullyconnected5_weight
[,1] [,2]
[1,] 2.171195 -2.166867
[2,] 2.145441 -2.132973
$fullyconnected5_bias
[1] -1.662504 1.662504
>
[ 183 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Furthermore, we can look into the weights our model has generated for this network.
Hopefully, this section gave you a good overview of the Mxnet R package. Using this
knowledge, let us go ahead and solve our time series problem.
[ 184 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
We leverage the package quantmod. The getSymbols function in quantmod can fetch stock
information from sources such as Yahoo and Google. As you can see, we are fetching
Apple's stock price data using their AAPL ticker.
We are interested only in the close price. We take the closing price out and peek at the top
rows.
We pull the data into a data frame called plot.data, so as to pass this conveniently to
ggplot.
[ 185 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
The graph shows the time series of the stock prices. It's a non-stationery time series with a
definite trend and error component.
The closing price from day 1 in our dataset, up to day 31, will be our X. The day
32 closing price will be our Y.
[ 186 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
The closing price from day 2 in our dataset, up to day 32, will be our X. The
closing price on day 33 will be our Y.
We will follow the same pattern. Once again, we will leverage quantmod package's Lag
functionality to prepare our data:
> data.ts <- as.ts(data)
> data.zoo <- as.zoo(data.ts)
> x.data <- list()
> for (j in 1:31){
+ var.name <- paste("x.lag.",j)
+ x.data[[var.name]] <- Lag(data.zoo,j)
+ }
> final.data <- na.omit(data.frame(x.data, Y = data.zoo))
> head(final.data)
Lag.1 Lag.2 Lag.3 Lag.4 Lag.5 Lag.6 Lag.7 Lag.8
Lag.9 Lag.10 Lag.11
32 4.250000 4.136161 3.883929 4.053571 4.022321 4.102679 4.073661 3.857143
3.689732 3.529018 3.580357
33 4.075893 4.250000 4.136161 3.883929 4.053571 4.022321 4.102679 4.073661
3.857143 3.689732 3.529018
34 4.102679 4.075893 4.250000 4.136161 3.883929 4.053571 4.022321 4.102679
4.073661 3.857143 3.689732
35 3.973214 4.102679 4.075893 4.250000 4.136161 3.883929 4.053571 4.022321
4.102679 4.073661 3.857143
36 4.064732 3.973214 4.102679 4.075893 4.250000 4.136161 3.883929 4.053571
4.022321 4.102679 4.073661
37 4.151786 4.064732 3.973214 4.102679 4.075893 4.250000 4.136161 3.883929
4.053571 4.022321 4.102679
Lag.12 Lag.13 Lag.14 Lag.15 Lag.16 Lag.17 Lag.18 Lag.19
Lag.20 Lag.21 Lag.22
32 3.705357 3.629464 3.928571 3.935268 4.008929 3.794643 3.975446 4.053571
3.805804 3.712054 3.587054
33 3.580357 3.705357 3.629464 3.928571 3.935268 4.008929 3.794643 3.975446
4.053571 3.805804 3.712054
34 3.529018 3.580357 3.705357 3.629464 3.928571 3.935268 4.008929 3.794643
3.975446 4.053571 3.805804
35 3.689732 3.529018 3.580357 3.705357 3.629464 3.928571 3.935268 4.008929
3.794643 3.975446 4.053571
36 3.857143 3.689732 3.529018 3.580357 3.705357 3.629464 3.928571 3.935268
4.008929 3.794643 3.975446
37 4.073661 3.857143 3.689732 3.529018 3.580357 3.705357 3.629464 3.928571
3.935268 4.008929 3.794643
Lag.23 Lag.24 Lag.25 Lag.26 Lag.27 Lag.28 Lag.29 Lag.30
Lag.31 Y
[ 187 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
We start by creating an empty x.data list. Inside the for loop we invoke the Lag function
to get the last 31 closing prices.
We combine this list with our original data and omit all the rows with NA values. Now we
are ready with our dataset.
train.data contains 80% of our data and test.data now contains the final 20%.
Let us now split the data into the dependent variable, y, and independent variable, x:
train.x.data <- data.matrix(train.data[,-1])
train.y.data <- train.data[,1]
[ 188 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
[ 189 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
[25] Train-mae=3.91980588843425
[26] Train-mae=3.86194357938237
[27] Train-mae=4.2168276017242
[28] Train-mae=3.97845429400603
[29] Train-mae=3.76002277152406
[30] Train-mae=3.83593577626679
[31] Train-mae=3.71273083243105
[32] Train-mae=3.6195217209061
[33] Train-mae=3.87637294858694
[34] Train-mae=3.79551469782988
[35] Train-mae=3.66272431297435
[36] Train-mae=3.81557290663322
[37] Train-mae=3.73366007523404
[38] Train-mae=3.57986994018157
[39] Train-mae=3.48014950540331
[40] Train-mae=3.70377947661612
[41] Train-mae=3.57618864927027
[42] Train-mae=3.52039705213573
[43] Train-mae=3.55409083919393
[44] Train-mae=3.64251783867677
[45] Train-mae=3.45834989514616
[46] Train-mae=4.0749755111999
[47] Train-mae=3.71578220281336
[48] Train-mae=3.50679727282789
[49] Train-mae=4.14914410233498
[50] Train-mae=3.7334080057674
[51] Train-mae=3.70052516341209
[52] Train-mae=3.91092829631435
[53] Train-mae=3.67670645170742
[54] Train-mae=3.63808277865251
[55] Train-mae=3.6711657425099
[56] Train-mae=3.45309603734149
[57] Train-mae=3.6100285096301
[58] Train-mae=3.47404639359977
[59] Train-mae=3.48650643053982
[60] Train-mae=3.44442329986228
[61] Train-mae=3.76036083085669
[62] Train-mae=3.61709513925844
[63] Train-mae=3.50891292694542
[64] Train-mae=3.74455126540528
[65] Train-mae=3.58543863587909
[66] Train-mae=3.46644685930676
[67] Train-mae=3.54323410050737
[68] Train-mae=3.64015973226892
[69] Train-mae=3.75287163681454
[70] Train-mae=3.48837021980021
[71] Train-mae=3.57296027570963
[72] Train-mae=3.77685429851214
[ 190 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
[73] Train-mae=3.63230897545815
[74] Train-mae=3.46640953759352
[75] Train-mae=3.42612222330438
[76] Train-mae=3.59378631307019
[77] Train-mae=3.71207889818483
[78] Train-mae=3.67349873264631
[79] Train-mae=3.91540269825194
[80] Train-mae=3.75410354889101
[81] Train-mae=3.74548216611147
[82] Train-mae=3.48044820434517
[83] Train-mae=3.45057798130645
[84] Train-mae=3.56516027308173
[85] Train-mae=3.82901276187764
[86] Train-mae=3.57472546464867
[87] Train-mae=3.5186486049162
[88] Train-mae=3.48254795942042
[89] Train-mae=3.5777530745003
[90] Train-mae=3.46502910438511
[91] Train-mae=3.50481863598029
[92] Train-mae=3.94887298007806
[93] Train-mae=3.85228754586644
[94] Train-mae=3.39591218464904
[95] Train-mae=3.32976617760129
[96] Train-mae=3.46257649577326
[97] Train-mae=3.74387675043609
[98] Train-mae=3.49124042812321
[99] Train-mae=3.49357031944725
[100] Train-mae=3.66018298261695
For more details about the parameters, please refer to the MXNet R's documentation.
[ 191 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Let us now write a small function to evaluate the performance of our model:
> model.evaluate <- function(deep.model, new.data, actual){
+ preds = predict(deep.model, new.data, array.layout = "rowmajor")
+ error <- actual - preds
+ return(mean(abs(error)))
+
+ }
>
The model.evaluate function takes the model, predicts the output, and compares the
output prediction with the actual values. It calculates and returns the mean absolute error.
We have a mean absolute error of 0.66 in our training data and 1.478 in the test data. This is
the average of the absolute difference between the actual stock close price and our
prediction. It's good to get such a low MAE. The plot of our predictions should show us the
best of this model.
Using the plot.data frame, let us plot the actual and predicted values:
ggplot(plot.data,aes(x = seq_along(actual))) + geom_line(aes(y = actual,
color = "Actual")) + geom_line(aes(y = predicted, color = "Predicted"))
[ 192 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Not bad, our model is trained well with the training data.
[ 193 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
Once again, we pack the actual and predicted values in a plot.data frame for ease of
plotting. Let us look at the prediction plot:
Great, we can see great performance with our test data too.
[ 194 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
[ 195 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
We can look into the weights learned in each of the layers. Let us look at the weights of the
first hidden layer:
> deep.model$arg.params$fullyconnected18_weight
[,1] [,2] [,3] [,4] [,5]
[,6] [,7]
[1,] -0.0083999438 8.485802e-04 0.0036450755 -0.05046703 8.365825e-03
8.337282e-03 -0.0087173656
[2,] 0.0013683429 -6.198279e-03 -0.0083191348 -0.05300714 -3.504494e-03
-5.474154e-03 0.0020545013
[,8] [,9] [,10] [,11] [,12]
[,13] [,14] [,15]
[1,] -6.238837e-04 -0.02354868 -0.02122656 0.0090157799 -0.02873407
-0.05015010 -0.06675677 -0.02245500
[2,] -2.801131e-03 -0.01479008 -0.01589586 -0.0063846195 -0.02265893
-0.05025507 -0.06229784 -0.02008835
[,16] [,17] [,18] [,19] [,20]
[,21] [,22] [,23]
[1,] -0.09068959 0.0078839101 -0.05393362 -0.0087832464 -0.02230835
-0.02040787 -0.08998983 -6.150414e-03
[2,] -0.09574232 0.0016337502 -0.06588018 -0.0046719122 -0.02494662
-0.03023234 -0.07455231 -1.322337e-03
[,24] [,25] [,26] [,27] [,28]
[,29] [,30] [,31]
[1,] -0.1390225 -0.01965304 -0.02212325 -0.01884965 -0.0020784298
-0.02215839 -0.06166138 -0.06153671
[2,] -0.1502734 -0.01914297 -0.02417008 -0.02372178 -0.0086792232
-0.02916201 -0.06242901 -0.06657734
[,32] [,33] [,34] [,35] [,36]
[,37] [,38] [,39]
[1,] -9.309854e-03 -0.06228027 -0.08344175 -0.05109306 -0.0040911064
-0.09716180 -5.205772e-03 -0.04241674
[2,] -8.990907e-03 -0.05343428 -0.08138016 -0.05199971 -0.0090623405
-0.08794980 -1.023613e-03 -0.05040253
[,40] [,41] [,42] [,43] [,44]
[,45] [,46] [,47]
[1,] -0.05714688 -0.02148283 -0.06252886 -0.06428687 -0.06470018
-8.545434e-04 -0.05015615 -0.06178543
[2,] -0.06344762 -0.01426465 -0.06747299 -0.06294472 -0.07089290
-2.695541e-03 -0.05909692 -0.06171135
[,48] [,49] [,50] [,51] [,52]
[,53] [,54]
[1,] -0.02646469 -6.758149e-03 -0.0091832494 -0.0094310865 -0.03371038
-0.05042901 -0.1533335
[2,] -0.03239496 4.416389e-03 0.0053570746 -0.0086173108 -0.01760352
-0.05314169 -0.1448236
[ 196 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
We have truncated the output. Similarly, other weights can be viewed. Note that your
weights may not be the same as the ones printed here.
Sometimes, looking at the learned weights of a neural network can provide insight into the
learning behavior. For example, if weights look unstructured, maybe some were not used at
all, or if very large coefficients exist, maybe regularization was too low or the learning rate
too high.
[ 197 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
According to the diagram, the x axis corresponds to the row number and the y axis to the
column number, with column 1 at the bottom.
[ 198 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
[ 199 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
This one has some small sparks spread about. It's better than the previous ones.
Finally, let us talk about tuning the model. There are several knobs to be tuned in this
model, and some of them are:
It's typically suggested to use random search in parameter space to search for the
parameters. Grid search is not recommended for deep learning networks.
We provide a function here which can be used to perform a random search across different
parameters:
random.search <- function(){
# Sample layers
count.layers <- sample(2:5, 1)
no.layers <- c()
activations <- c()
for (i in 1:count.layers-1){
# Sample node per layers
no.nodes <- sample(10:50,1)
no.layers[i] <- no.nodes
activations[i] <- "relu"
}
no.layers <- append(no.layers, 1)
activations <- append(activations, "relu")
deep.model <- mx.mlp(data = train.x.data, label = train.y.data,
hidden_node = no.layers
, out_node = 1
, dropout = 0.50
, activation = activations
,out_activation = "rmse"
, array.layout = "rowmajor"
, learning.rate = 0.01
, array.batch.size = 100
, num.round = 10
, verbose = TRUE
, optimizer = "adam"
[ 200 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
, eval.metric = mx.metric.mae
)
train.error <- model.evaluate(deep.model,train.x.data, train.y.data)
test.error <- model.evaluate(deep.model,test.x.data, test.y.data)
output <- list(layers = no.layers, activations <- activations
, train.error = train.error, test.error = test.error)
return(output)
}
final.output = list()
for (i in 1:2){
out <- random.search()
final.output[[i]] <- out
}
You can go head and modify the code for other parameters too, such as dropout, learning
rate, and other parameters, to see how the models evolve.
Before we conclude, let us have some discussion around the suitability of deep learning
networks for time series analysis. Earlier, when we discussed time series, we referred to the
trend, seasonality, and error components in time series data. These lead to the non-
stationary aspect of time series data. We suggested that we should remove the seasonality
and the trend component from the time series data to use them for prediction in regression
models. When it comes to using a neural network for time series predictions, there are
several options available:
Data Preprocessing:
1. Remove the trend and seasonality and use the non-stationary data for
prediction.
2. Arrange the data in a way that includes the seasonality correlation in
the input fed to the network.
3. Use a scheme of rolling windows as shown in this chapter.
Network:
1. Deep neural networks formed my multiple hidden layers.
2. Recurrent neural networks (RNN).
3. LSTM networks--a variation of the RNN network.
[ 201 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
In our example, we didn't do any data preprocessing to remove trend and seasonality.
There are several research papers available on this topic. Most of them are empirical studies
using well known time series data to analyze the effect of data preprocessing in prediction
output. One paper is an empirical study conducted by Sharda and Patil. They compared
prediction using ARIMA and Neural Networks on 75 time series data and found Neural
Network predictions to be more accurate. You can access the paper at https://link.
springer.com/article/10.1007/BF01577272.
We have successfully created a fully connected regression deep learning network to predict
the stock price. RNN and LTSM network APIs are currently not available with MXNet.
Hopefully, they should be available soon.
Complete R code
The following is the complete R code:
# Time series data
kings <- scan("http://robjhyndman.com/tsdldata/misc/kings.dat",skip=3)
king.ts <- ts(kings)
king.ts
install.packages("TTR")
library(TTR)
par(mfrow=c(2 ,2))
plot(SMA(king.ts, n=2), main = "n=2")
plot(SMA(king.ts, n=5), main = "n=5")
plot(SMA(king.ts, n=10), main = "n=10")
plot(SMA(king.ts, n=15), main = "n=15")
par(mfrow=c(1,1))
library(zoo)
library(quantmod)
[ 202 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
x1 <- Lag(data,1)
new.data <- na.omit(data.frame(Lag.1 = x1, y = data))
head(new.data)
plot(king.ts)
# Introducing MXNet library
library(mxnet)
# Context
mx.ctx.default()
# Operations
mx.nd.dot(ones.matrix, ones.matrix)
mx.nd.elemwise.add(ones.matrix,ones.matrix)
# Sampling
mu <- mx.nd.array(c(0.0,2.5))
sigma <- mx.nd.array(c(1,3))
mx.nd.sample.normal(mu = mu, sigma = sigma)
# Symbolic Programming
a <- mx.symbol.Variable("a")
a
b <- mx.symbol.Variable("b")
b
c <- a + b
c
[ 203 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
mx.exec.update.arg.arrays(pexec, input_list)
mx.exec.forward(pexec)
pexec$arg.arrays
pexec$outputs
d <- mx.symbol.dot(a, b)
mx.exec.update.arg.arrays(pexec, input_list)
mx.exec.forward(pexec)
pexec$arg.arrays
pexec$outputs
[ 204 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
X <- data.matrix(mx.nd.concat(list(x.1,x.2,x.3,x.4)) )
Y <- c(y.1,y.2,y.3,y.4)
# Input layer
data <- mx.symbol.Variable("data")
# Hidden Layer
hidden.layer <- mx.symbol.FullyConnected(data = data
, num_hidden = 2)
# Output Layer
out.layer <- mx.symbol.FullyConnected(data = act
, num.hidden = 2)
# Softmax of output
out <- mx.symbol.SoftmaxOutput(out.layer)
# Perform Predictions
X_test = data.matrix(rbind(c(0,0),c(1,1),c(1,0),c(0,1) ) )
preds = predict(model, X_test, array.layout = "rowmajor")
pred.label <- max.col(t(preds)) -1
[ 205 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
pred.label
########################################
# Feature generation
data.ts <- as.ts(data)
data.zoo <- as.zoo(data.ts)
set.seed(100)
[ 206 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
# Train/test split
train.perc = 0.8
train.indx = 1:as.integer(dim(final.data)[1] * train.perc)
mx.set.seed(100)
deep.model <- mx.mlp(data = train.x.data, label = train.y.data,
hidden_node = c(1000,500,250)
,out_node = 1
,dropout = 0.50
,activation = c("relu", "relu","relu")
,out_activation = "rmse"
, array.layout = "rowmajor"
, learning.rate = 0.01
, array.batch.size = 100
, num.round = 100
, verbose = TRUE
, optimizer = "adam"
, eval.metric = mx.metric.mae
print("Train Error")
model.evaluate(deep.model,train.x.data, train.y.data)
print("Test Error")
model.evaluate(deep.model,test.x.data, test.y.data)
[ 207 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
graph.viz(deep.model$symbol)
library(Matrix)
# Look at the graph plot for the name of each layer
# alternatively call deep.model$arg.params$ to see the name
weights <- deep.model$arg.params$fullyconnected85_weight
dim(weights)
image(as.matrix(weights))
# Sample layers
count.layers <- sample(2:5, 1)
no.layers <- c()
activations <- c()
for (i in 1:count.layers-1){
# Sample node per layers
no.nodes <- sample(10:50,1)
no.layers[i] <- no.nodes
activations[i] <- "relu"
}
no.layers <- append(no.layers, 1)
activations <- append(activations, "relu")
deep.model <- mx.mlp(data = train.x.data, label = train.y.data,
hidden_node = no.layers
, out_node = 1
[ 208 ]
Taming Time Series Data Using Deep Neural Networks Chapter 4
, dropout = 0.50
, activation = activations
,out_activation = "rmse"
, array.layout = "rowmajor"
, learning.rate = 0.01
, array.batch.size = 100
, num.round = 10
, verbose = TRUE
, optimizer = "adam"
, eval.metric = mx.metric.mae
)
train.error <- model.evaluate(deep.model,train.x.data, train.y.data)
test.error <- model.evaluate(deep.model,test.x.data, test.y.data)
output <- list(layers = no.layers, activations <- activations
, train.error = train.error, test.error = test.error)
return(output)
}
final.output = list()
for (i in 1:2){
out <- random.search()
final.output[[i]] <- out
}
Summary
We started the chapter by introducing time series data and the traditional approaches to
solving them. We gave you an overview of deep learning networks and information on how
they learn. Furthermore, we introduced the MXNet R package. Then we prepared our stock
market data so that our deep learning network could consume it. Finally, we built two deep
learning networks, one for regression, where we predicted the actual closing price of the
stock, and one for classification, where we predicted whether the stock price would move
up or down.
In the next chapter, we will deal with sentiment mining. We will show how to extract
tweets in R, process them and use a dictionary based method to find the sentiments of the
tweets. Finally using those scored tweets as datasets we will build a Naive Bayes model
based on Kernel density estimate.
[ 209 ]
Twitter Text Sentiment
5
Classification Using Kernel
Density Estimates
Twitter and other social media applications are generating a lot of unstructured text today
at a great velocity. Today's product companies look at these social media applications as a
great source to give them an understanding of how consumers feel about their products or
services. Hence, the analysis of these unstructured texts has become very important for
these companies.
Text mining as field is gaining importance day by day. The invention of techniques such as
bag-of-words, word2vec embedding has made text data suitable for processing by most
machine learning algorithms. Analyzing the sentiment in a text is an active field of research.
Sentiment analysis, also called opinion mining, is the field of study that
analyzes people's opinions, sentiments, evaluations, appraisals, attitudes,
and emotions towards entities such as products, services, organizations,
individuals, issues, events, topics, and their attributes. Liu, Bing. Sentiment
Analysis and Opinion Mining (Kindle Locations 199-201). Morgan and
Claypool Publishers.
We will look at how to pre-process text so it can be efficiently used by machine learning
algorithms. We will introduce a recently published state of the art word weighting scheme
called Delta TFIDF. Delta TFIDF is a variation of the traditional tfidf weighting scheme,
specially designed for sentiment classification tasks. Sentiment classification can be done
using either a dictionary method or a machine learning approach. We will show you
examples of both.
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
Given a text, the model should predict the sentiment of the text as either positive or
negative.
Any data we analyze is generated from a process. The underlying process emits the data
following a probability distribution. Kernel density estimate techniques help find the
underlying probability distribution. It helps find the probability density function for the
given sample of data. Once we have an idea about the distribution of the data, we can
leverage this understanding of the data to make decisions further down the line based on
the application.
In this chapter, our application is to classify a given text as either positive sentiment
oriented or negative sentiment oriented. In our training set, we will have both positively
oriented and negatively oriented sentences. We will be using kernel density
estimation (KDE) for this classification task. Using KDE, we can find the distribution for
positively oriented text and negatively oriented text. The assumption is that these two
distributions should be different.
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As
we work through our example we will introduce the R packages twitteR, sentimentr,
tidyr, tm, and naivebayes, which we will be using. During our code description, we will
be using some of the output printed in the console. We have included what will be printed
in the console immediately following the statement which prints the information to the
console, so as to not disturb the flow of the code.
[ 211 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We can generate histograms using the following code block for reference:
> data <- rnorm(1000, mean=25, sd=5)
> data.1 <- rnorm(1000, mean=10, sd=2)
> data <- c(data, data.1)
> hist(data)
> hist(data, plot = FALSE)
$breaks
[1] 0 5 10 15 20 25 30 35 40 45
$counts
[1] 8 489 531 130 361 324 134 22 1
$density
[1] 0.0008 0.0489 0.0531 0.0130 0.0361 0.0324 0.0134 0.0022 0.0001
$mids
[1] 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5
$xname
[1] "data"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
This code creates two artificial data-sets and combines them. Both datasets are based on the
normal distribution; the first has a mean of 25 and standard deviation of 5, the second has a
mean of 10 and standard deviation of 2. If we recall from basic statistics, a normal
distribution is in a bell shape centered at the mean value, and 99% of the values are between
the mean-(3*sd) and mean+(3*sd). By combining these data from these two distributions,
the new dataset will have two peaks and a small area of overlap.
[ 212 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
The output in R shows the bin size and the locations. Look at $breaks and $mids in the R
output.
We provide the hist function with the breaks parameter, which is a sequence from 0 to
50, with a step size of two. We want our bin size to be two rather than 10—the default value
R is selected.
[ 213 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
As you can see, by varying the bin size, the representation of the data changes. You can test
the same effect by varying the bin locations or $mids. The histogram depends on having the
right bin size and bin location.
The alignment of the bins with the data points decide the height of a bar in the histogram.
Ideally, this height should be based on actual density of the nearby points. Can we stack the
bar in such way that the height is proportional to the points they represent. This is where
KDE one-ups the histogram.
A discrete random variable takes on a finite number of possible values. We can determine
the probability for all the possible value of that random variable using a probability mass
function. A continuous random variable takes on infinite number of possible values. We
cannot find the probability of a particular value, instead we find the probability that a value
falls in some interval. A probability density function (PDF) is used to find the probability
of a value falling in an interval for a continuous random variable.
[1]0.9544997
[ 214 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
pnorm is the function to get the cumulative probability distribution function for normal
distributions, i.e. the probability that a given value is within a range of a normal
distribution. Here we calculate the probability that a value falls between 2 standard
deviations of the mean value, which we know from basic statistics is 95%. We can calculate
this probability by looking at the probability that the given value is greater than the
mean+(2*sd); we then multiply that by 2 (to account for the values less than mean-(2*sd)).
This gives us the probability the values are in the "tails" of the distribution, so we subtract
this value from 1 to get the probability that the values lie between mean +/- (2*sd) to give
the answer of 0.9544997. Note that even if you change the values of mean and sd, the result
is always 0.9544997.
Let us see a univariate example using the density function to approximate the PDF of the
given data:
# Kernel Density estimation
kde = density(data)
plot(kde)
[ 215 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
This smoothed out plot from KDE gives us a better idea about the distribution of the
underlying data. The density function in R uses a Gaussian kernel by default for each
point in the data. The parameters of KDE are the kernel, which is used at every point to
decide the shape of the distribution at each point, and the bandwidth, which defines the
smoothness of the output distribution.
Why use a kernel instead of calculating the density at each point and them summing it up?
Where h is the distance threshold. This may give a bumpy plot as each point xi in the
neighborhood defined by h receives the same weight.
Where K is the kernel and h is the smoothing parameter. The kernel gives weight to each
point xi based on its proximity to x0.
Look at the R help function to learn more about the density function:
help(density)
Hopefully, this section gives an understanding of how to use kernel density estimate to
approximate the probability density function of the underlying dataset. We will be using
this PDF to build our sentiment classifier.
Refer to Chapter 6's, Kernel Smoothing methods section in The Elements of Statistical Learning,
by Trevor Hastie, Robert Tibshirani and Jerome Friedman for a rigorous discussion on
kernel density estimates.
[ 216 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
Twitter text
We will leverage the twitteR package to extract tweets. Refer to https://cran.r-project.
org/web/packages/twitteR/index.html to get more information about this package.
In order to use this package, you need a Twitter account. With the account, sign in to
https://app.twitter.com and create an application. Use the consumer key, consumer
secret key, access token, and access token security keys from that page to authenticate into
Twitter.
We will retrieve only the English tweets using the searchTwitter function provided by
the twitteR package:
tweet.results <- searchTwitter("@apple", n=1000,lang = "en")
tweet.df <- twListToDF(tweet.results)
Using the function twListToDF, we convert our extracted tweets from Twitter to a
dataframe, tweet.df. Out of all the fields extracted, we are most interested in the name
text column.
We have removed all the retweets from our results and retain just the text column.
Our tweet.df data frame now only has a single text column.
[ 217 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We have removed the URLs, spaces, Unicode characters, and so on from our tweets.
Having extracted and cleaned up this Twitter text, let us proceed to build our text
classification.
Sentiment classification
Sentiment in text can be either positive or negative. We will stick to this definition for this
chapter. Sentiment mining is an active field of research. A starting point for someone to
learn the various aspects of sentiment mining is through the book Sentiment Analysis and
Opinion Mining, by Bing Liu.
Broadly speaking, sentiment mining problems are solved using the following techniques:
[ 218 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
Dictionary methods
Sentiment lexicons, in which words are categorized as positive and negative, are used in
this technique. There are several lexicons available today:
We match these lexicons to the tokens (words) in the input data and we can then calculate
an overall sentiment, for example, based on the proportion of positive words to negative
words.
The tidyr package has three sentiment lexicons. In the book Text Mining with R: The Tidy
Approach, there is a full chapter on sentiment mining using lexicons: http://
tidytextmining.com/sentiment.html.
Our approach
In this chapter, we are going to leverage dictionary and machine learning methods.
[ 219 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
After extracting our text, we are going to use the dictionary-based approach to first give
sentiment scores to the extracted tweets. Later, we will manually verify if the scores are
good. This is our gold set of data for building a classifier. We divide this data into two sets,
one we will use for training our classifier, and the other we will use to test our classifier.
Since we don't have any pre-labeled tweet data, we are following this approach.
Let us now proceed to build our gold dataset using a sentiment dictionary.
Let us see how to score using the sentiment function from the sentimentr package:
> library(sentimentr, quietly = TRUE)
> sentiment.score <- sentiment(tweet.df$text)
> head(sentiment.score)
element_id sentence_id word_count sentiment
[ 220 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
1: 1 1 8 0.0000000
2: 2 1 8 0.3535534
3: 3 1 3 0.0000000
4: 3 2 4 0.0000000
5: 3 3 7 0.0000000
6: 4 1 14 -0.8418729
The sentiment function in sentimentr calculates a score between -1 and 1 for each of the
tweets. In fact, if a tweet has multiple sentences, it will calculate the score for each sentence.
A score of -1 indicates that the sentence has very negative polarity. A score of 1 means that
the sentence is very positive. A score of 0 refers to the neutral nature of the sentence.
Refer to https://cran.r-project.org/web/packages/sentimentr/index.
html for more details about the sentimentr package.
However, we need the score to beat the tweet level and not at the sentence level. We can
take the average value of the score across all the sentences in a tweet.
Here, the element_id refers to the individual tweet. By grouping by element_id and
calculating the average, we can get the sentiment score at the tweet level.
Let us now add the sentiment to our original tweet.df. data frame. Going forward, we
only need the text and its sentiment; let us subset those columns from tweet.df:
> tweet.df$polarity <- sentiment.score$sentiment
> tweet.final <- tweet.df[,c('text','polarity')]
[ 221 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We have our dataset prepared now, with the tweet text and the sentiment score.
We remove all records with a polarity value of 0. These are records with neutral sentiments.
If our polarity is less than zero, we mark the tweet as negative, otherwise, it is positive:
> tweet.final <- tweet.final[tweet.final$polarity != 0, ]
> tweet.final$sentiment <- ifelse(tweet.final$polarity < 0,
"Negative","Positive")
> tweet.final$sentiment <- as.factor(tweet.final$sentiment)
> table(tweet.final$sentiment)
Negative Positive
200 168
Using ifelse, we have discretized the real-valued sentiment score into a binary variable.
We added an identifier column. With that, we have the training data ready. Finally, the
table command shows us the class distribution. Our class distribution is imbalanced.
Let us say we are building a discriminative model such as logistic regression or SVM, where
the model is trying to learn a boundary between the classes. It expects an equal number of
positive classes and negative classes. If the number of positive classes is greater than or less
than the number of negative classes in the dataset, we have a class imbalance problem.
There are several techniques to balance the dataset. Some of them are downsampling,
upsampling, SMOTE, and so on. We will leverage the function upSample in the caret
package to create more records in the minority class. This should produce better results in
the classification models.
Negative Positive
200 200
The upSample function looks at the class distribution and repeatedly samples from the
Positive class to balance. The final table command shows the new class distribution.
[ 222 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
In this chapter, we are going to learn the distribution of positive and negative classes
independently and use them for our predictions. Hence, we don't need to do balance the
classes.
Before we move on to the next section, let us spend some time understanding the inner
workings of our dictionary-based sentiment function. The sentiment function utilizes a
sentiment lexicon (Jockers, 2017) from the lexicon package. It preprocesses the given text
as follows:
Each word is looked up in the lexicon, and positive and negative words are tagged with +1
and -1 respectively. Let us call the words which have received a score as the polarized
words. Not all words will receive a score. Only those found in the lexicons receive a score.
We can pass a customer lexicon through the polarity_dt parameter to
the sentiment function.
For each of the polarized words, n words before them and n words after them are
considered, and together they are called polarized context clusters. The parameter n can be
set by the user. The words in the polarized context cluster can be tagged as either of the
following:
Neutral
Negator
Amplifier
De-amplifier
Adversative conjunctions
[ 223 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
Each polarized word is weighted now based on polarity_dt, and also weighted based on
the number of valence shifters/words surrounding it, which are tagged either as amplifiers
or adversative conjunctions. Neutrally tagged weights have no weights.
For more details about the weight and scoring, refer to R help for the sentiment function.
Text pre-processing
Before we build our model, we need to prepare our data so it can be provided to our model.
We want a feature vector and a class label. In our case, the class label can take two values,
positive or negative depending on if the sentence has a positive or a negative sentiment.
Words are our features. We will use the bag-of-words model to represent our text as
features. In a bag-words-model, the following steps are performed to transform a text into a
feature vector:
1. Extract all unique individual words from the text dataset. We call a text
dataset a corpus.
2. Process the words. Processing typically involves removing numbers and other
characters, placing the words in lowercase, stemming the words, and removing
unnecessary white spaces.
3. Each word is assigned a unique number and together they form the vocabulary.
A word uknown is added to the vocabulary. This is for the unknown words we
will be seeing in future datasets.
4. Finally, a document term matrix is created. The rows of this matrix are the
document IDs, and the columns are formed by the words from the vocabulary.
[ 224 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We need to pre-process our tweets in a similar manner. We will use the tm R package to
pre-process our twitter text.
The get.dtm function creates a document term matrix from the given data frame. It does
text pre-processing/normalization before creating the actual document term matrix.
Preprocessing involves removing any punctuation, removing numbers, stripping unwanted
white space, removing English stop words, and finally converts the text into lowercase.
[ 225 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
The output is truncated, as we have not shown all the stop words. In text mining, stop
words are considered those which do not contribute much to the context of the text. Let's
say we have two documents--if we want to differentiate them, we cannot use stop words to
find what those two documents are uniquely representing. Hence, we typically remove
these words from our text as a pre-processing exercise.
We have around 62 documents and 246 words. Document term matrices are typically
sparse, as all words don't appear in all the documents. This brings us to the weighting
scheme. When we explained the document term matrix, we used a binary weightage
scheme. A one was added the cell if a word was present in a document; if the word did not
appear in the document, a zero was added. We can use a different weighting scheme called
term-frequency inverse document frequency.
The document frequency or df for a given word w is the number of documents in which the
word has occurred. Document frequency is considered the global weight.
[ 226 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
idf( w, d) = log (number of documents in the corpus / 1 + document frequency of the word)
Term-frequency inverse document frequency or TFIDF is the product of tf and idf. They
define the importance of a word in our corpus. According to this scheme, the most
frequently occurring words are not given high importance since they don't carry enough
information to differentiate one tweet from another.
You can see the TFIDF value for the word fire in document 2 is 0.27.
TFIDF is a great weighting scheme. It has been successfully used in many text mining
projects and information retrieval projects. We are going to use a modified version of
TFIDF called Delta TFIDF for our feature generation.
Delta TFIDF
The problem with TFIDF is it fails to differentiate between words from the perspective of
implicit sentiments. During the calculation of TFIDF, no knowledge of the document
sentiment is added. Hence, it may not serve as a good differentiating feature for sentiment
classification.
Delta TFIDF was proposed by Justin Martineau and Tim Finin in their paper Delta TFIDF:
An Improved Feature Space for Sentiment
Analysis: http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf.
[ 227 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
Delta TFIDF is calculated for each word and document combination as follows:
Finds Ctd, the number of times the word occurs in the document, term frequency
of the word.
Finds nt, the number of negative documents in which the word has occurred.
Finds pt, the number of positive documents in which the word has occurred.
We get the document term matrix for our whole corpus. The rows are our document and
the columns are our vocabulary. We use term frequency for our weighting scheme. Our dtm
is very sparse. There are a lot of cells with zero values. Let us throw away some terms and
try to reduce the sparsity of our document term matrix.
> dtm <- removeSparseTerms(dtm, 0.98)
> dtm
<<DocumentTermMatrix (documents: 368, terms: 58)>>
Non-/sparse entries: 934/20410
Sparsity : 96%
Maximal term length: 11
After reducing the sparseness, we convert our document term matrix to a matrix object.
Now let us split our data into a positive tweets dataset and a negative tweets dataset and
get their respective document term matrices:
dtm.pos <- get.dtm('text','id', tweet.final[tweet.final$sentiment ==
'Positive',],"weightBin")
dtm.neg <- get.dtm('text','id', tweet.final[tweet.final$sentiment ==
'Negative',],"weightBin")
[ 228 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
dtm.post.mat and dtm.neg.mat are the matrix representations of the positive and the
negative tweet corpuses.
Similarly, neg.words.df contains the words and their document frequencies in the
negative corpus.
Let us get all the unique words and the document IDs:
> tot.features <- colnames(dtm.mat)
> doc.ids <- rownames(dtm.mat)
for( i in 1:length(tot.features)){
for ( j in 1:length(doc.ids)){
# Number of times the term has occured in the document
ctd <- dtm.mat[doc.ids[j], tot.features[i]]
# Number for documents in pos data with the term
pt <- pos.words.df[tot.features[i]]
# Number for documents in pos data with the term
nt <- neg.words.df[tot.features[i]]
score <- ctd * log( nt / pt)
if(is.na(score)){
score <- 0
}
c.dtm.mat[doc.ids[j], tot.features[i]] <- score
}
}
Our dtm.mat has the term frequency for each word and document. We use pos.words.df
and neg.words.df to find the document frequency of a word in both positive and negative
corpuses. Finally, we update c.dtm.mat with the new score.
[ 229 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We have calculated the Delta TFIDF. This brings us to the end of this section. We have
prepared our data to be consumed by the model. Let us proceed to build our sentiment
classification model.
We have a set of tweets positively labeled. Another set of tweets negatively labeled. The
idea is to learn the PDF of these two data sets independently using kernel density
estimation.
Here, P(x | label) is the likelihood, P(label) is prior, and P(x) is the evidence. Here the label
can be positive sentiment or negative sentiment.
Using the PDF learned from kernel density estimation, we can easily calculate the
likelihood, P(x | label)
For any new tweet, we can now calculate using the Bayes Rule,
P(Label = Positive | words and their delta tfidf weights)
Viola, we have our sentiment classifier assembled based on kernel density estimation.
Before we jump into our use case, let us see quickly how KDE can be leveraged to classify
the Iris data set. More information about this data set is available in https://archive.ics.
uci.edu/ml/datasets/iris.
[ 230 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We pass the first 4 columns as our features, x and the last column species as our y variable.
There are three classes in this dataset. There is a total of 150 records, 50 records per each
class.
We have a plot for every column in iris dataset. Let us look at the petal.width column.
There are three density plots each representing, P(Petal.width, setosa),
P(Petal.width, versicolor), and finally P(Petal.width, virginica) one for each
class variable. Each of them represents the underlying distribution of Petal.width for
each one of the classes.
[ 231 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
Other plots can be interpreted similarly. Using this PDF we can now classify the data into
one of the three classes. Hopefully, this gives an idea of how the underlying distribution
discovered by KDE can be used to separate the records into their respective class instance.
We use the naivebayes R package. The function naive_bayes is used to build the model.
You can see that we have set the useKernel parameter to TRUE. This informs the function
to use KDE for calculating the likelihood.
This will help you view the various properties of this model.
Having built the model, we can use the standard predict function to predict the label for
unknown tweets.
library(caret)
confusionMatrix(preds, tweet.final$sentiment)
Reference
Prediction Negative Positive
Negative 128 58
Positive 72 110
Accuracy : 0.6467
95% CI : (0.5955, 0.6956)
No Information Rate : 0.5435
P-Value [Acc > NIR] : 3.744e-05
Kappa : 0.2928
Mcnemar's Test P-Value : 0.2542
Sensitivity : 0.6400
Specificity : 0.6548
[ 232 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We have an accuracy of 64%. Can we understand why we reached this number? A good
way to investigate this is to look at the features fed to this model. In this case, the words we
have fed as features. Looking at the PDF of words for both positive and negative sentiment,
we should be able to throw some light on our model performance.
Let us look at some of the variables and their PDF for positive and negative sentiments:
The graph represents a subset of words used for classification. It is evident that underlying
PDF estimated by kernel density estimate is not much different for the positive and the
negative classes. PDF for positive and negative are represented by the red and green line. In
most of the cases both the PDF are overlapping each other. Compare these plots to the plot
we generated for the Iris dataset, where the PDFs were distinctly separate. Hence the
model does not have enough classification power.
[ 233 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
That brings us to the end of the section. We have shown you a simple Naive Bayes classifier
using KDE estimates for calculating the likelihood. You can go ahead and use other
classification methods and try to compare the results. Another good exercise would be to
compare the normal TFIDF features to the Delta TFIDF features.
Let us first load all the necessary libraries and authenticate into our Twitter account:
library(shiny)
library(twitteR, quietly = TRUE)
library(stringr)
library(sentimentr, quietly = TRUE)
library(dplyr, quietly = TRUE)
[ 234 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We have a navigation panel at the top. The first navigation panel has a textbox in which we
can enter our search string. A default value is provided, so when you open the application,
you will have the search results available for that string. We have a data table output
following the textbox. This is where we display the tweet results.
In the second navigation panel, we have the data table output, where we want to show the
tweets and their sentiments.
[ 235 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
tweet.df is declared as a reactive expression. It will change if the text in our search box
changes. Here, we search Twitter for the search string provided. Capture the returned
tweets, clean them up, calculate their polarity scores, and keep them ready.
It's simple from now on. output$results return a data frame. It takes the data frame from
our tweet.df reactive expression. This one only returns the text column.
output$sentiment behaves the same way as output$results, except that it returns the
whole data frame.
Reactive expressions are expressions that can read reactive values and call
other reactive expressions. Whenever reactive value changes, any reactive
expressions that depended on it are marked as invalidated and will
automatically re-execute if necessary. - RShiny web page.
[ 236 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
We can change the search string and have the new results shown:
You can see that we have changed the query to R data science.
In the preceding screenshot, we can see the sentiment for the tweets.
[ 237 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
Complete R code
The following is the complete R code used for this project:
# Generate data
data <- rnorm(1000, mean=25, sd=5)
data.1 <- rnorm(1000, mean=10, sd=2)
data <- c(data, data.1)
# Plot histogram
hist(data)
# View the bins
hist(data, plot = FALSE)
library(pdfCluster)
kde.1 <- kepdf(data)
plot(kde.1)
kde.1@kernel
kde.1@par
# Twitter Text
library(twitteR, quietly = TRUE)
[ 238 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
# KDE Classifier
library(tm)
get.dtm <- function(text.col, id.col, input.df, weighting){
[ 239 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
stopwords("english")
# delta tf-idf
dtm.pos <- get.dtm('text','id', tweet.final[tweet.final$sentiment ==
'Positive',],"weightBin")
dtm.neg <- get.dtm('text','id', tweet.final[tweet.final$sentiment ==
'Negative',],"weightBin")
for( i in 1:length(tot.features)){
for ( j in 1:length(doc.ids)){
# Number of times the term has occured in the document
ctd <- dtm.mat[doc.ids[j], tot.features[i]]
# Number for documents in pos data with the term
pt <- pos.words.df[tot.features[i]]
# Number for documents in pos data with the term
nt <- neg.words.df[tot.features[i]]
score <- ctd * log( nt / pt)
if(is.na(score)){
score <- 0
}
dtm.mat[doc.ids[j], tot.features[i]] <- score
}
}
library(naivebayes)
[ 240 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
consumer.key <-
consumer.secret <-
access.token <-
token.secret <-
[ 241 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates Chapter 5
ui <- fluidPage(
navbarPage("TweetSenti",
tabPanel("Search Tweets"
, textInput("search","search",value="#bladerunner")
, dataTableOutput("results")
),
tabPanel("Tag Sentiment"
,dataTableOutput("sentiment"))
)
)
Summary
We started this chapter with a discussion about the KDE and its usefulness in
understanding the underlying distribution of data. We proceeded by explaining how to
extract tweets from Twitter for a given search string in R. Then, we proceeded to explain the
sentiment ming, dictionary, and machine learning approaches. Using a dictionary
approach, we calculated the sentiment scores for the tweets. We further explained text pre-
processing routines required to prepare the text data. We covered weighting schemes for
creating document term matrixes. We discussed the classic tfidf and the new Delta
TFIDF schemes. We created our training set using the Delta TFIDF scheme. Using this
training set, we finally built a Naive Bayes KDE classifier to classify tweets based on the
sentiment the text carried.
In the next chapter, we will be working on Record Linakage. A master data management
technique to do data dedpulication.
[ 242 ]
Record Linkage - Stochastic
6
and Machine Learning
Approaches
In a large database of records, synonymous records pose a great problem. Two records
referring to the same entity are considered to be synonymous. In the absence of a common
identifier, such as a primary key or foreign key, joining such records based on the entities is
a tough task. Let's illustrate this with a quick example. Consider the following two records:
Sno First name Middle name Last name Address City State Zip
1 John NULL NULL 312 Delray Ave Deer Field FL 33433
2 John NULL Sanders 312 Delray Beach Ave Deer Field FL 33433
Both the records refer to the same entity, one Mr. John. Record linkage refers to an umbrella
of algorithms that are designed to solve the exact same problem. Record linkage plays a key
role today in various applications such as CRM, Loyalty to name a few. They are an integral
part of today's sophisticated business intelligence systems and master data management
systems.
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As
we work through our example we will introduce the R packages RecordLinkage we will
be using. During our code description, we will be using some of the output printed in the
console. We have included what will be printed in the console immediately following the
statement which prints the information to the console, so as to not disturb the flow of the
code.
[ 244 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
The preceding image is a snapshot of the data from our customer and Solit. Assume that we
have merged the database records from both the parties. The record number indicates the
source of the record, that is whether they came from our customer or Solit. If we can find
duplicates in this dataset, that should serve as the first step in identifying the customers
channeled by Solit.
[ 245 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
Our data, RLdata500, has 500 records and 7 variables, which includes first name, last
name, and date of birth details. The first and last names are separated into two components
denoted by the suffixes, _c1 and _c2. The date of birth is split into year, day, and month.
Let's look at the steps that we are going to follow to implement record linkage:
Feature generation is the first step in record linkage. Once we have the desired features, we
can solve the record linkage problem either using a stochastic approach or by fitting a
machine learning model to the generated features.
[ 246 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
Feature generation
We will use compare.dedup to generate the features:
> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim)
> summary(rec.pairs)
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
Our constraints are represented in a list. We say we need either to match the first column or
columns 5 up to 7 for two records to qualify to become a pair. You can see in the results that
we are finally left with only 1,221 pairs:
Deduplication Data Set
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
[ 247 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
The entry pairs in the list form a data frame that has all the generated features. We capture
this data frame under the name matches:
matches <- rec.pairs$pairs
Let's look at the first two rows of this data frame. Each row compares two records:
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0.1428571 NA 0 0 0 NA
2 1 204 1 NA 0.0000000 NA 0 0 0 NA
[ 248 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
The first instance compares records 1 and 174. There is a perfect match in the first
component of the first name. Both the entities do not have a second component for the first
name. We see a float number in the first component of the last name. This number is the
output of a string comparison. There is no match in the date of birth fields. The final column
is a is_match indicating if we have a match. We will get to the last column later. Let's start
with the string comparison.
String features
Let's look at the dedup function invocation once again:
> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim).
The strcmp and strcmpfun parameters dictate on which fields we need to do string
comparison and what kind of string comparison we need to apply. We pass a vector
indicating the column IDs to strcmp. We need to do string comparisons in columns 2, 3,
and 4. We want to use the Levenshtein distance to find the similarity between two strings.
Let's look at the records 1 and 174; we see the first name matching, but no match with the
rest of the fields. The Levenshtein distance of 0.142857 also states how far the last names are
from each other:
> RLdata500[1,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
1 CARSTEN <NA> MEIER <NA> 1949 7 22
> RLdata500[174,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
174 CARSTEN <NA> SCHMITT <NA> 2001 6 27
[ 249 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
We need to state the exact columns where string comparisons should be applied. Not
specifying this may lead to unexpected results:
> rec.pairs.matches <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = TRUE
+ ,strcmpfun = levenshteinSim)
> head(rec.pairs.matches$pairs)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0.1428571 NA 0.00 0.5 0.5 NA
2 1 204 1 NA 0.0000000 NA 0.50 0.5 0.0 NA
3 2 7 1 NA 0.3750000 NA 0.75 0.5 0.0 NA
4 2 43 1 NA 0.8333333 NA 1.00 1.0 1.0 NA
5 2 169 1 NA 0.0000000 NA 0.50 0.0 0.5 NA
6 4 19 1 NA 0.1428571 NA 0.50 0.5 0.0 NA
You can see that the function has also calculated the string comparisons for the date of birth
fields!
We had excluded the first column from string comparison. It was used as
a blocking field. However, we can use string comparison for the first
column. In that case, the string comparison is performed before the
blocking.
Phonetic features
The RecordLinkage package includes Soundex and Pho_h algorithms to compare string
columns. In our example, we want to use columns 2, 3, and 4 for string comparison,
specified by the list we pass to the phonetic parameter, and use the Pho_h function by
passing it to the phonfun parameter:
[ 250 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
If we compare the results with the string matching output, we see that we find no match
between record IDs 1 and 174. Let's look at instance 4, where records id1, 2 and id2, 43
are compared:
> RLdata500[2,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
2 GERD <NA> BAUER <NA> 1968 7 27
> RLdata500[43,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
43 GERD <NA> BAUERH <NA> 1968 7 27
The last name in those cases sound similar, hence the algorithm has captured them as
similar records.
[ 251 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
We will show how to leverage two methods, emWeights and epiWeights, implemented in
the RecordLinkage package.
P (features | match = 0) and P (features | match = 1) are estimated using the expectation
maximization algorithm. The weights are calculated as the ratio of these two probabilities.
This approach is called the Fellegi-Sunter model.
> library(RecordLinkage)
> data("RLdata500")
> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim)
> pairs.weights <- emWeights(rec.pairs)
> hist(pairs.weights$Wdata)
>
[ 252 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
As seen in the feature generation section, we use the dedup function to generate string
comparison features. With the features, we invoke the emWeights function to get the
Fellegi-Sunter weights. The output of emWeights is a list:
> str(pairs.weights)
List of 8
$ data :'data.frame': 500 obs. of 7 variables:
..$ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128
112 77 42 139 26 99 ...
..$ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA
NA NA NA NA NA NA ...
..$ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50
23 76 61 77 30 ...
..$ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA
NA NA NA NA ...
..$ by : int [1:500] 1949 1968 1930 1957 1966 1929 1967 1942 1978
1971 ...
..$ bm : int [1:500] 7 7 4 9 1 7 8 9 3 2 ...
..$ bd : int [1:500] 22 27 30 2 13 4 1 20 4 27 ...
$ pairs :'data.frame': 1221 obs. of 10 variables:
..$ id1 : num [1:1221] 1 1 2 2 2 4 4 4 4 4 ...
..$ id2 : num [1:1221] 174 204 7 43 169 19 50 78 83 133 ...
..$ fname_c1: num [1:1221] 1 1 1 1 1 1 1 1 1 1 ...
..$ fname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: num [1:1221] 0.143 0 0.375 0.833 0 ...
..$ lname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ by : num [1:1221] 0 0 0 1 0 0 1 0 0 0 ...
..$ bm : num [1:1221] 0 0 0 1 0 0 0 0 1 0 ...
..$ bd : num [1:1221] 0 0 0 1 0 0 0 0 0 0 ...
..$ is_match: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
$ frequencies: Named num [1:7] 0.00685 0.04167 0.00926 0.11111 0.01163 ...
..- attr(*, "names")= chr [1:7] "fname_c1" "fname_c2" "lname_c1"
"lname_c2" ...
$ type : chr "deduplication"
$ M : num [1:128] 0.000355 0.001427 0.004512 0.01815 0.001504 ...
$ U : num [1:128] 2.84e-04 8.01e-06 2.52e-05 7.10e-07 2.83e-06
...
$ W : num [1:128] 0.322 7.477 7.486 14.641 9.053 ...
$ Wdata : num [1:1221] -10.3 -10.3 -10.3 12.8 -10.3 ...
- attr(*, "class")= chr "RecLinkData"
>
[ 253 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
The Wdata vector stores the weights for Record Linkage based on an EM algorithm, higher
values indicate better matches. Let's plot this data as a histogram to look at the weights
distribution:
The histogram is skewed with a lot of negative weights and very few positive weights. This
gives us a hint that we have very few matches in our dataset. Alternatively, we can view the
weight distribution as follows:
> summary(pairs.weights)
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
Weight distribution:
[ 254 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
The getPairs function conveniently gives the weights for the pair:
> weights.df<-getPairs(pairs.weights)
> head(weights.df)
id fname_c1 fname_c2 lname_c1 lname_c2 by bm bd Weight
1 48 WERNER <NA> KOERTIG <NA> 1965 11 28
2 238 WERNIER <NA> KOERTIG <NA> 1965 11 28 29.628078
3
4 68 PETEVR <NA> FUCHS <NA> 1972 9 12
5 190 PETER <NA> FUCHS <NA> 1972 9 12 29.628078
6
For record IDs 48 and 238, the weight is 29.62. The higher the weight is, the more
probability there is of a match. With the weights, now we can use a threshold-based
classification model. We can derive the thresholds from either the histogram or the weight
distribution. We are going to choose the upper threshold, that is, for a match, we need a
weight of 10 or more. For a no match, we set the lower threshold as 5, and any entity pairs
with less than 5 will be tagged as a no match. The emClassify function is used to classify
the entities as match and no match:
> pairs.classify <- emClassify(pairs.weights, threshold.upper = 10,
threshold.lower = 5)
> summary(pairs.classify)
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
Weight distribution:
51 links detected
2 possible links detected
1168 non-links detected
Classification table:
[ 255 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
classification
true status N P L
<NA> 1168 2 51
The label N stands for no match or no links found. Label P stands for possible matches and
label L for matches aka links founds. We see that with our given threshold, 51 matches were
found. Let's make a single data frame to collate all our results:
> final.results <- pairs.classify$pairs
> final.results$weight <- pairs.classify$Wdata
> final.results$links <- pairs.classify$prediction
> head(final.results)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match weight
links
1 1 174 1 NA 0.1428571 NA 0 0 0 NA -10.28161
N
2 1 204 1 NA 0.0000000 NA 0 0 0 NA -10.28161
N
3 2 7 1 NA 0.3750000 NA 0 0 0 NA -10.28161
N
4 2 43 1 NA 0.8333333 NA 1 1 1 NA 12.76895
L
5 2 169 1 NA 0.0000000 NA 0 0 0 NA -10.28161
N
6 4 19 1 NA 0.1428571 NA 0 0 0 NA -10.28161
N
>
[ 256 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
[ 257 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
Weights-based method
The epiWeights function implements the weights-based method. R documentation has a
nice introduction to the weights-based method:
> help("epiWeights")
For more details about the weights method, refer to P. Contiero et al. The
Epilink record linkage software.
Methods Inf Med., 44(1):66–71, 2005.
The mechanism of invoking and finally generating the results is very similar to how we did
it using emWeights:
library(RecordLinkage)
data("RLdata500")
# weight calculation
rec.pairs <- compare.dedup(RLdata500
,blockfld = list(1, 5:7)
,strcmp = c(2,3,4)
,strcmpfun = levenshteinSim)
[ 258 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
One again, the distribution is similar to our histogram from the emWeights method. Let's
see an alternate view of the weights distribution:
> summary(pairs.weights)
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
Weight distribution:
[ 259 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
# Classification
pairs.classify <- emClassify(pairs.weights, threshold.upper = 0.5,
threshold.lower = 0.3)
[ 260 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
> head(final)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match links
fname_c1.1 fname_c2.1 lname_c1.1
1 106 175 1 NA 1.0000000 NA 1 0 1 NA L
ANDRE <NA> MUELLER
2 108 203 1 NA 1.0000000 NA 0 1 1 NA L
GERHARD <NA> FRIEDRICH
3 112 116 1 NA 0.8000000 NA 1 1 1 NA L
GERHARD <NA> ERNSR
4 120 165 1 NA 0.8750000 NA 1 1 1 NA L
FRANK <NA> BERGMANN
5 125 193 1 NA 0.8750000 NA 1 1 1 NA L
CHRISTIAN <NA> MUELLEPR
6 127 142 1 NA 0.8333333 NA 1 1 0 NA L
KARL <NA> KLEIN
lname_c2.1 by.1 bm.1 bd.1 Weight
1 <NA> 1976 2 25 0.6910486
2 <NA> 1987 2 10 0.6133400
3 <NA> 1980 12 16 0.7518301
4 <NA> 1998 11 8 0.7656562
5 <NA> 1974 8 9 0.7656562
6 <NA> 2002 6 20 0.6228760
>
Unsupervised learning
Let's start with an unsupervised machine learning technique, K-means clustering. K-means
is a well-known and popular clustering algorithm and works based on the principles of
expectation maximization. It belongs to the class of iterative descent clustering methods.
Internally, it assumes the variables are of quantitative type and uses Euclidean distance as a
similarity measure to arrive at the clusters.
[ 261 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
The K is a parameter to the algorithm. K stands for the number of clusters we need. Users
need to provide this parameter.
Unsupervised techniques are appealing as they don't require us to build a training set. For
us to use a supervised technique, we need to create a training set, a tuple consisting of our
features, and a label to say if that record pair is a match or no-match. Till now in our case,
we haven't had a training dataset. Hence, let's start with an unsupervised technique:
> library(RecordLinkage, quietly = TRUE)
> data("RLdata500")
> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim)
> kmeans.model <- classifyUnsup(rec.pairs, method = "kmeans")
> summary(kmeans.model)
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
Classification table:
classification
true status N P L
<NA> 1071 0 150
[ 262 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
We generate our features using the compare.dedup method. This time we are sticking to
string-based features. The classifyUnSup method invokes K-means from the package,
e1071. A K value of 2 is supplied to the algorithm. As we see the output of the algorithm,
we see that K-means has created two groups, one for no link represented by N and has
around 1,071 records assigned to that cluster. Around 150 are assigned to the L cluster:
> final.results <- kmeans.model$pairs
> final.results$prediction <- kmeans.model$prediction
> head(final.results)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match prediction
1 1 174 1 NA 0.1428571 NA 0 0 0 NA N
2 1 204 1 NA 0.0000000 NA 0 0 0 NA N
3 2 7 1 NA 0.3750000 NA 0 0 0 NA N
4 2 43 1 NA 0.8333333 NA 1 1 1 NA L
5 2 169 1 NA 0.0000000 NA 0 0 0 NA N
6 4 19 1 NA 0.1428571 NA 0 0 0 NA N
>
We extract the input pairs and the prediction, where L stands for link available aka match
and N stands for no link available aka no match.
In cases where we have the ground truth, that is features for the tuples and also the ground
truth saying if the tuples are representing the same entity, then we can leverage
classification modes in supervised settings. Let us look at the supervised learning
Supervised learning
In a supervised learning scenario, we need to provide the algorithm with a set of training
tuples. Each tuple has our features from record pairs and a label classifying the tuple as
either a match or no match. In our case, we don't have any labeled data.
[ 263 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
If you see the output of rec.pairs$pairs, you will notice that now the is_match column
says if the record pair is a match or no match. Previously, when we did not provide the
identity parameter, it was initialized to NA. We are going to leverage this output to train our
classification model:
> train <- getMinimalTrain(rec.pairs)
> model <- trainSupv(train, method ="bagging")
> train.pred <- classifySupv(model, newdata = train)
> test.pred <- classifySupv(model, newdata = rec.pairs)
>
> summary(train.pred)
500 records
17 record pairs
9 matches
8 non-matches
0 pairs with unknown status
9 links detected
0 possible links detected
8 non-links detected
Classification table:
classification
true status N P L
FALSE 8 0 0
TRUE 0 0 9
> summary(test.pred)
[ 264 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
500 records
1221 record pairs
49 matches
1172 non-matches
0 pairs with unknown status
52 links detected
0 possible links detected
1169 non-links detected
Classification table:
classification
true status N P L
FALSE 1168 0 4
TRUE 1 0 48
>
Using the getMinimalTrain function, we get a small set of records from our rec.pairs
as our training data. We initialize and train a bagging model using this data with the
trainSupv function. Finally, using the classifySupv function, we run our model on our
training data and test data, which is the whole rec.pairs in this case to get the
predictions. Finally, using the summary function, we can look at the accuracy of our model.
We have a 100% accurate model in our training set. RecordLinkage supports a lot of
classification models, including neural networks, svm, bagging, and trees. As we have
a very small training set, it is advisable to use bagging or svm. Finally, we have around
99% accuracy on our whole dataset.
Alternatively, we can leverage our unsupervised clustering output. Use that as an initial
training set to build our first supervised learning model:
> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim)
>
> # Run K-Means Model
> kmeans.model <- classifyUnsup(rec.pairs, method = "kmeans")
>
[ 265 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
We pass our rec.pairs to the clustering method and extract the pairs with their
predictions. We want to replace the is_match column with our predictions. However, the
values should be 0 or 1 in the is_match column instead of N or L:
> pairs$is_match <- NULL
> pairs$is_match <- ifelse(pairs$prediction == 'N', 0,1)
> pairs$prediction <- NULL
> pairs[is.na(pairs)] <- 0
> head(pairs)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 0 0.1428571 0 0 0 0 0
2 1 204 1 0 0.0000000 0 0 0 0 0
3 2 7 1 0 0.3750000 0 0 0 0 0
4 2 43 1 0 0.8333333 0 1 1 1 1
5 2 169 1 0 0.0000000 0 0 0 0 0
6 4 19 1 0 0.1428571 0 0 0 0 0
>
> rec.pairs$pairs <- pairs
> head(rec.pairs$pairs)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 0 0.1428571 0 0 0 0 0
2 1 204 1 0 0.0000000 0 0 0 0 0
3 2 7 1 0 0.3750000 0 0 0 0 0
4 2 43 1 0 0.8333333 0 1 1 1 1
5 2 169 1 0 0.0000000 0 0 0 0 0
6 4 19 1 0 0.1428571 0 0 0 0 0
>
>
[ 266 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
Now, having changed our is_match column to binary, we replace the original pairs data
frame in rec.pairs with our modified data frame pairs. With this achieved, we can follow
the process of generating a small train test, building a model, and verifying the accuracy of
the model as we did in the previous section:
train <- getMinimalTrain(rec.pairs)
Warning message:
In getMinimalTrain(rec.pairs) :
Comparison patterns in rpairs contain string comparison values!
> model <- trainSupv(train, method ="bagging")
> train.pred <- classifySupv(model, newdata = train)
> test.pred <- classifySupv(model, newdata = rec.pairs)
>
> summary(train.pred)
500 records
82 record pairs
38 matches
44 non-matches
0 pairs with unknown status
38 links detected
0 possible links detected
44 non-links detected </strong>
Classification table:
classification
true status N P L
FALSE 44 0 0
TRUE 0 0 38
> summary(test.pred)
500 records
1221 record pairs
[ 267 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
150 matches
1071 non-matches
0 pairs with unknown status
Classification table:
classification
true status N P L
FALSE 1071 0 0
TRUE 0 0 150
We have achieved a 100% accuracy in both our training and whole datasets.
Load the RLdata500 from the RecordLinkage package and display to the user
Implement the weights algorithm and display the weight range as a histogram
Allow the user to select the lower and upper thresholds of weights for
classification
Based on the user-selected threshold, do a record matching and display the
duplicate entities discovered.
[ 268 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
There are two panels, one to show the RLdata500 and an other one to show the results of
the weights algorithm.
[ 269 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
outptu$weights runs the emweights algorithm and displays the histogram. The user
interface provides a slider for threshold selection. The user can choose the necessary
threshold based on the histogram displayed. Using these thresholds, a classification is made
and the results are shown to the user.
[ 270 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
The results of the emWeight algorithm is shown are a histogram. The histogram shows the
range of weights thrown by the emWeight algorithm. In order to find duplicates, users can
select a lower and upper threshold for weights and pass them to emClassify method.
Complete R code
The complete R code for this chapter is given as follows. We have split the different
algorithms used for record linkage into different sections, for ease of reading.
[ 271 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
Feature generation
Let us begin with the feature generation R code:
library(RecordLinkage, quietly = TRUE)
summary(rec.pairs)
RLdata500[1,]
RLdata500[174,]
# String features
rec.pairs.matches <- compare.dedup(RLdata500
,blockfld = list(1, 5:7)
,strcmp = c(2,3,4)
,strcmpfun = levenshteinSim)
head(rec.pairs.matches$pairs)
[ 272 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
# Phoenotic features
rec.pairs.matches <- compare.dedup(RLdata500
,blockfld = list(1, 5:7)
,phonetic = c(2,3,4)
,phonfun = pho_h)
head(rec.pairs.matches$pairs)
RLdata500[2,]
RLdata500[43,]
After feature generation, let us look at the code for the expectation maximization method.
# Em weight calculation
rec.pairs <- compare.dedup(RLdata500
,blockfld = list(1, 5:7)
,strcmp = c(2,3,4)
,strcmpfun = levenshteinSim)
pairs.weights <- emWeights(rec.pairs)
hist(pairs.weights$Wdata)
summary(pairs.weights)
weights.df<-getPairs(pairs.weights)
head(weights.df)
# Classification
pairs.classify <- emClassify(pairs.weights, threshold.upper = 10,
threshold.lower = 5)
[ 273 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
xlab="Link Types")
This is the complete source code for the expectation maximization method. Let us now look
at the weights method.
Weights-based method
The weights-based algorithm for record linkage:
library(RecordLinkage)
data("RLdata500")
# weight calculation
rec.pairs <- compare.dedup(RLdata500
,blockfld = list(1, 5:7)
,strcmp = c(2,3,4)
,strcmpfun = levenshteinSim)
summary(pairs.weights)
weights.df<-getPairs(pairs.weights)
head(weights.df)
# Classification
pairs.classify <- emClassify(pairs.weights, threshold.upper = 0.5,
threshold.lower = 0.3)
[ 274 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
# weight calculation
rec.pairs <- compare.dedup(RLdata500
,blockfld = list(1, 5:7)
,strcmp = c(2,3,4)
,strcmpfun = levenshteinSim)
# Unsupervised classification
kmeans.model <- classifyUnsup(rec.pairs, method = "kmeans")
summary(kmeans.model)
# Supervised Learning 1
str(identity.RLdata500)
rec.pairs <- compare.dedup(RLdata500
,identity = identity.RLdata500
,blockfld = list(1, 5:7)
)
head(rec.pairs$pairs)
[ 275 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
summary(train.pred)
summary(test.pred)
# Supervised learning 2
rec.pairs <- compare.dedup(RLdata500
,blockfld = list(1, 5:7)
,strcmp = c(2,3,4)
,strcmpfun = levenshteinSim)
summary(train.pred)
summary(test.pred)
Having seen the source code for all the different methods, let us now proceed to look at the
complete source code for our RShiny application.
[ 276 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
RShiny application
The RShiny application source code:
library(shiny)
library(RecordLinkage)
data("RLdata500")
ui <- fluidPage(
navbarPage("Record Linkage",
tabPanel("Load"
, dataTableOutput("records")
),
tabPanel("Weights Method"
,plotOutput("weightplot")
,sliderInput("lowerthreshold", "Weight Lower
threshold:",
min = 0.0, max = 1.0,
[ 277 ]
Record Linkage - Stochastic and Machine Learning Approaches Chapter 6
value =0.2)
,sliderInput("upperthreshold", "Weight Upper
threshold:",
min = 0.0, max = 1.0,
value =0.5)
,dataTableOutput("weights")
)
)
)
Summary
We introduced the problem of record linkage and emphasized its importance. We
introduced the package, RecordLinkage, in R to solve record linkage problems. We started
with generating features, string- and phonetic-based, for record pairs so that they can be
processed further down the pipeline to dedup records. We covered expectation
maximization and weights-based methods to perform a dedup task on our record pairs.
Finally, we wrapped up the chapter by introducing machine learning methods for dedup
tasks. Under unsupervised methods, K-means clustering was discussed. We further
leveraged the output of the K-means algorithm to train a supervised model.
In the next chapter we go through streaming data and its challenges. We will build a stream
clustering algorithm for a given streaming data.
[ 278 ]
Streaming Data Clustering
7
Analysis in R
In all those instances where data is collected from various sources, brought to a centralized
location, and stored for analysis, that data is called as data at rest. There is a huge time
delay between the time the data was recorded and when the analysis was performed.
Analyzing the last 6 months' inventory data is an example of data at rest. Today, majority of
data analysis is performed using data at rest.
With the number of Internet of Things projects on the rise, there is a great demand today to
perform analysis on data in motion, also called streaming data. Streaming data is becoming
ubiquitous with the number of addressable sensors and devices being added to the internet.
As an example from computer network monitoring: an intrusion detection system analysis,
the network packets received in real time to quickly determine if the system is
compromised and takes an appropriate action. Latency is key when analyzing data in
motion.
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As
we work through our example, we will introduce the R stream packages we will be using.
During our code description, we will be using some of the output printed in the console. We
have included what will be printed in the console, immediately following the statement
which prints the information to the console, so as to not disturb the flow of the code.
The processing challenges in stream data are shown in the following figure:
[ 280 ]
Streaming Data Clustering Analysis in R Chapter 7
Bounded problems
The first challenge is deciding on a window, and what size window we need to
accommodate to make sense of the incoming data. By window, we mean storing the last n
data points. For streaming data, it is rare to actually process records one at a time. One of
the most common ways to process streaming data is to process them in a window. This can
either bundle data points into a group and process them as a unit or it can be a changing
group of n data points (or x time) where older records are discarded and newer records
added.
Drift
The second challenge is the non-stationary aspect of the streaming data. Data is considered
stationary if its statistical attributes such as mean, standard deviation, and others do not
vary over time. However, we cannot make this assumption for streaming data. This non-
stationary behavior is also called drift. Our algorithms should be able to spot and handle
drifts efficiently.
The paper, Open Challenges for Data Stream Mining Research, by Georg Kremp et. al (http://
www.kdd.org/exploration_files/16-1-2014.pdf), clearly classifies the various issues with
processing the stream data.
Single pass
Further, the incoming stream has to be processed in a single pass. Any delay may lead to
the loss of data or loss of a window of opportunity.
[ 281 ]
Streaming Data Clustering Analysis in R Chapter 7
Real time
Finally, the algorithm has to be fast enough; it has to process the data at the same speed at
which the data arrives.
Now that we have elicited the processing issues associated with stream processing, let us
move on to the next section.
The online/offline two-stage processing is the most common framework adopted by many
of the stream clustering algorithms.
Micro-clusters are created by a single pass to the data. As each data point arrives, it is
assigned to the closest micro-cluster. You can think of a micro-cluster as a summary
of similar data points. The summary is typically stored in the form of a cluster center, the
local density of the point, and may include more statistics, such as variance. In the stream, if
we are not able to allocate a new incoming data point to any of the existing micro-
clusters, a new micro-cluster is formed with that data point.
The online step deals with micro-cluster formation. As the data arrives, either new
micro-clusters are created or points are assigned to existing micro-clusters.
[ 282 ]
Streaming Data Clustering Analysis in R Chapter 7
Macro-cluster
When a more robust clustering is either demanded by the user or by the application, the
offline stage kicks off. Here the micro-clusters generated in the online steps are used as
initial centers. A more rigorous clustering is performed in the background to generate the
final list of clusters, called a macro-cluster. Since this is not time-critical, it can be executed
the same way we work on data at rest.
For a more rigorous treatment of the various stream data clustering algorithm, please refer
to State-of-the-art on clustering data streams Mohammed Ghesmoune, by Mustapha Lebbah and
Hanene Azzag: https://bdataanalytics.biomedcentral.com/track/pdf/10.1186/
s41044-016-0011-3?site=bdataanalytics.biomedcentral.com
[ 283 ]
Streaming Data Clustering Analysis in R Chapter 7
[ 284 ]
Streaming Data Clustering Analysis in R Chapter 7
[ 285 ]
Streaming Data Clustering Analysis in R Chapter 7
Calling the get_points function with this stream and specifying the number of tuples to
be generated, using parameter n, we get our dataframe of 100 points in a three-dimensional
space. With the class parameter, we also request the cluster group the point it belongs to.
[ 286 ]
Streaming Data Clustering Analysis in R Chapter 7
If we want to have a simulation function mimicking a live IoT system, where data is
generated every n seconds/minutes, we can place the DSD function inside a while loop with
a timer. Calling this while loop would continuously generate data every n seconds/minutes
as per the timer configuration.
There are a lot of data generators available within the stream package. Refer to the stream
documentation to learn more about other data generators: https://cran.r-project.org/
web/packages/stream/vignettes/stream.pdf
>
DSD_MG stands for moving generator. It's a drift generator consisting of several MGCs. In
this example, we have used a linear MGC. The add_key function frame allows us to specify
the drift happening over time. Inside the function, we define the centers of the cluster for a
particular time period denoted by the parameter time. We have defined three frames where
the center is changed.
[ 287 ]
Streaming Data Clustering Analysis in R Chapter 7
Look at the time, it says 1. Let us get the data at this time and plot the same.
[ 288 ]
Streaming Data Clustering Analysis in R Chapter 7
Now we see that time is around 20. Let u get another 1,000 points and plot them.
The plot of points gathered around time 20 is shown in the following figure:
Clearly, the center has changed compared to the last time frame.
We see that time is now around 40. Let us take another 1000 data points.
[ 289 ]
Streaming Data Clustering Analysis in R Chapter 7
The three graphs convey the drift clearly. We can see the centers of the clusters shifting over
time.
There are several other useful functions, such as writing and reading streams to disk,
playing back the stream, and many others. We strongly urge you to go through the package
documentation.
A curious reader may be wondering, in both examples we showed, we sampled the data at
a particular time stamp. We never used the stream of data in real time. Please bear with us
until we discuss our use case. We will show how the stream can be fed into clustering
algorithm, so the algorithms can now work on the incoming stream of data and not on
sampled data at different time stamps. Though stream package does not provide a direct
mechanism to do this, we achieve this with an external time and matching the number of
records sampled to the window size we need to process.
[ 290 ]
Streaming Data Clustering Analysis in R Chapter 7
We used DSD_Gaussian to create some random four-dimensional records, and have stored
them in a dataframe called data.un.
We have created a memory stream interface to our dataframe: data.un. Now, we can play
this dataframe as we like:
> get_points(replayer, n=5)
X1 X2 X3 X4
1 199.9456 77.09095 19.98670 750.1129
2 195.9696 80.10380 16.01115 789.9727
3 196.0109 79.95394 16.08678 790.1042
4 199.9882 77.03385 20.00825 750.0069
5 200.0485 76.84687 19.94311 750.0130
> replayer
Memory Stream Interface
Class: DSD_Memory, DSD_R, DSD_data.frame, DSD
With NA clusters in 4 dimensions
[ 291 ]
Streaming Data Clustering Analysis in R Chapter 7
Using get_points, we have extracted five points. We can now see that our interface is in
position 6.
DSD_ReadCSV can help to read a very large CSV file line by line, if it cannot be loaded into
memory completely.
Inflight operation
DSD_ScaleStream can be used to standardize (centering and scaling) data in a data stream
in-flight.
We saw various types of DSD in this section. DSD acting as a simulator is primarily used for
experimentation. It comes in handy when we need to try new data analysis algorithms. The
DSD to read a CSV file or read a database can come in very handy in real-world
applications.
1. We can write our own DSD derived from the abstract DSD class.
2. An actual stream can be made to be written to a CSV file or a database. In this
case, we can leverage DSD_Memory to read the stream.
[ 292 ]
Streaming Data Clustering Analysis in R Chapter 7
Data stream clustering is implemented as a part of the data stream task (DST) in the stream
package. It is called a DSC class. DSC_R is a native R implementation of the clustering
algorithms. DSC_MOA provides an interface to algorithms implemented in the MOA Java
framework. In this chapter, we will be only looking at DSC_R, the native R algorithms.
The DSC classes provide two functions to perform the online and the offline steps.
The update() function accepts a DSD and a DSC object, and creates the micro-clusters. At
any point, we can query this class for the number of micro-clusters formed.
The recluster function performs the offline step. The macro-clusters are formed from the
micro-clusters inside this function.
Let us look at how to code up a data stream clustering method using the stream package.
Let us do the online part to begin with, where we want to form micro-clusters:
> stream <- DSD_Gaussians(k = 3, d = 2, noise = .05)
> stream
Mixture of Gaussians
Class: DSD_Gaussians, DSD_R, DSD_data.frame, DSD
With 3 clusters in 2 dimensions
We have used a Gaussian data generator, generating a two-dimensional data formed into
three clusters.
Now that we have created our data generator, let us move on to our clustering algorithm:
> clus.alg <- DSC_DBSTREAM(r = 0.03)
> clus.alg
DBSTREAM
Class: DSC_DBSTREAM, DSC_Micro, DSC_R, DSC
Number of micro-clusters: 0
Number of macro-clusters: 0
>
[ 293 ]
Streaming Data Clustering Analysis in R Chapter 7
Internally, the DBStream algorithm stores, for each micro-cluster, a data point that defines
its center. This data point is called as the leader. It also stores the density of the cluster in an
area defined by the radius threshold. This threshold is provided by the user. While
initializing DBStream, we provided as the radius an r parameter.
If the data point falls into the assignment area of several micro-clusters, then all of them are
updated.
Refer to the paper, Clustering Data Streams Based on Shared Density Between Micro-Clusters, by
Michael Hahsler and Matthew Bolan ̃os, for more information about DBStream.
With this understanding, let's call the update function to create micro-clusters:
> update(clus.alg, stream, n =100)
> clus.alg
DBSTREAM
Class: DSC_DBSTREAM, DSC_Micro, DSC_R, DSC
Number of micro-clusters: 5
Number of macro-clusters: 2
As you can see, we call the update function every time with both our DSD object stream
and DSC object clus.alg. During each call, we pull around 100 records, n =100
[ 294 ]
Streaming Data Clustering Analysis in R Chapter 7
Number of micro-clusters: 5
Number of macro-clusters: 2
> update(clus.alg, stream, n =100)
> clus.alg
DBSTREAM
Class: DSC_DBSTREAM, DSC_Micro, DSC_R, DSC
Number of micro-clusters: 26
Number of macro-clusters: 9
>
The update function creates the micro-clusters. We can see the number of micro-
clusters from the shell output.
Finally, we plot our data points and the micro-clusters using plot(clus.alg,
stream)
We can see the data points and the micro-clusters in the following graph:
The DSC_DBSTREAM has inbuilt reclustering, hence you can see the macro-
clusters in the output. There is no need to run a separate offline process.
[ 295 ]
Streaming Data Clustering Analysis in R Chapter 7
Now that we have our clusters, both micro and macro, we need a good way to evaluate the
quality of these clusters. A good clustering algorithm should produce the output in such a
way that the inter-cluster distance is large and the intra-cluster distance is small. The
stream package provides a function called evaluate to find the quality of the clustering
algorithm's output:
> evaluate(clus.alg, stream)
Evaluation results for micro-clusters.
Points were assigned to micro-clusters.
[ 296 ]
Streaming Data Clustering Analysis in R Chapter 7
It gives a lot of different evaluation metrics. We can choose one among them based on our
criteria.
[ 297 ]
Streaming Data Clustering Analysis in R Chapter 7
There are many more clustering algorithms implemented in the stream. Refer to the
documentation for other algorithms.
Hopefully, by now we are equipped with enough information to solve our use case using
stream package.
This example is inspired by the sensor network example given in pubnub at: https://www.
pubnub.com/developers/realtime-data-streams/sensor-network/.
[ 298 ]
Streaming Data Clustering Analysis in R Chapter 7
Our digital control system is generating data at a particular rate. It has the following sensors
installed:
These data from the sensors are pumped to a message queue. We have two systems
consuming the data from our message queue.
Speed layer
This is where the real-time analysis of the incoming data is performed. Our R code will be
running here. If this layer was realized by say apache storm and bolt (http://storm.
apache.org/), the bolts would be running our cluster creation code.
The speed layer talks to the real-time dashboard and the mobility layer. The micro-
clusters discovered as pushed to the dashboard and to the mobility layer.
[ 299 ]
Streaming Data Clustering Analysis in R Chapter 7
Batch layer
We are not concerned about the batch layer for this exercise. The batch layer is responsible
for keeping the master data intact. Since it has a very low latency, we have the speed layer
to process the continuous stream of data.
Now that we have a good idea of the overall architecture of our system, let us proceed to
develop the stream clustering algorithm. For the cluster algorithm development purpose,
we are going to leverage the DSD simulators in the stream package to act as our sensors in
the digital control system.
As you can see, we want DSD_Gaussian to generate a four-dimensional dataset, one for
each sensor. We expect two clusters to be formed out of this dataset. The centers for the
clusters are provided as a matrix to the mu parameter.
[ 300 ]
Streaming Data Clustering Analysis in R Chapter 7
It's a good practice to scale the incoming data before we proceed to do any analysis. Luckily
we have DSD_ScaleStream to apply scaling on the incoming stream data.
[ 301 ]
Streaming Data Clustering Analysis in R Chapter 7
The DSD_ScaleStream class is used to scale the incoming data from our Gaussian stream.
Let us plot this data:
We have normalized our data and are ready to apply the clustering.
Inside the stream package, we have a class, DSC_TwoStage, which will help us pipeline
our online and offline algorithms. Let us see how we can use this class:
> micro.alg <- DSC_Sample(k = 10)
> macro.alg <- DSC_Kmeans(k = 2)
We begin with defining the online clustering algorithm for creating micro-clusters. We
use the DSC_Sample class. It's one of the simplest online clustering algorithms. It extracts
samples from a data stream using reservoir sampling. The extracted sample is stored as a
set of micro-clusters. Before we proceed further, let us quickly look at the reservoir
sampling.
[ 302 ]
Streaming Data Clustering Analysis in R Chapter 7
Reservoir sampling
Given n items, where n is very large or not known in advance, the reservoir sampling
algorithm helps us select k samples from n items. The algorithm to begin with initializes an
array of size k. It then copies the first k items from the stream into the array. Now, it
proceeds to evaluate each item in the array. A random number, j, between 0 and i, is
generated, where i is the index of the item we are currently evaluating. If, j is in range of 0
to k-1, we replace the j element in the array with i th element in the stream.
The parameter k, to DSC_Sample, is used to set the k value for the reservoir sampling, and
eventually sets the number of micro-clusters we expect from the algorithm. In our case, we
have set k to 10. Further down the code, we see that 10 micro-clusters are generated.
Next, we define the clustering algorithm for the offline step. In this case, we use a simple k-
means algorithm:
We will not indulge in the details of the k-means algorithm here. More information about R
k means can be found at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/
kmeans.html
[ 303 ]
Streaming Data Clustering Analysis in R Chapter 7
numMicroClusters purity
10 1
Using DSC_TwoStage, we pipe our online and offline algorithms together. By calling the
update function, we are invoking our offline and online algorithm. We run our algorithm
on our normalized data stream, stream.norm.
Purity is one of the simplest measures to assess the quality of clusters. Purity is derived by
summing up the number of majority elements in each cluster and dividing it by the total
number of clusters. Look at the following example:
In cluster 1, the triangle is the majority class and the count is 4. In cluster 2, the circle is the
majority class and the count is again 4. The purity is then (4 + 4)/10, so 0.8 is the purity of
this cluster.
In our data case, we have achieved a purity of 1. Our clusters are therefore a very good fit to
the incoming streaming data.
[ 304 ]
Streaming Data Clustering Analysis in R Chapter 7
Now, we need to move this set up to our streaming infrastructure. It is beyond the scope of
this book to show how to setup the streaming infrastructure shown above. However, we
will give some pointers here to deploy our stream clustering algorithm:
1. We need to run R and install the necessary packages including stream package
into our speed layer.
2. Assuming that the speed layer is some kind of set up similar to Apache storm
and bolts, we can have the following setup:
1. A storm bolt combination, reads the data from the message queue and
populates a in-memory database. This can be a small sqldb in memory
or other key value databases.
2. Another storm bolt combination acts as follows:
1. Use DSD_ReadDB to connect to the database using R DBI and
extract the data.
2. Micro/Macro Cluster twin, set up using DSC_TwoStage can
now query the DSD_ReadDB at fixed intervals as defined by
the overall application. When using function update, the
window size can be passed as parameter n. The results of the
clustering can now be pushed to the dashboard (or pull
mechanism, from the dashboard can be setup). Function
get_points and get_centers can provide the cluster
details.
Hopefully, this gave an overview of how stream package can be used in a real-time data
processing framework.
Complete R code
install.packages("stream")
library(plotly)
p <- plot_ly(data, x = ~X1, y = ~X2, z = ~X3, color = ~class) %>%
add_markers() %>%
[ 305 ]
Streaming Data Clustering Analysis in R Chapter 7
# Drift generators
drift.stream <- DSD_MG(dim = 2)
cluster.1 <- MGC_Linear(dim = 2)
drift.stream
data <- get_points(drift.stream, n = 1000)
plot_ly(data, x = ~X1, y= ~X2) %>% add_markers() %>% layout(title = 'time
around 1')
drift.stream
data <- get_points(drift.stream, n = 1000)
plot_ly(data, x = ~X1, y= ~X2) %>% add_markers() %>% layout(title = 'time
around 20')
drift.stream
data <- get_points(drift.stream, n = 1000)
plot_ly(data, x = ~X1, y= ~X2) %>% add_markers() %>% layout(title = 'time
around 40')
evaluate(clus.alg, stream)
[ 306 ]
Streaming Data Clustering Analysis in R Chapter 7
par(mfrow = c(2,2))
data.un <- get_points(sensor.stream, n =100)
head(data.un)
par(mfrow = c(1,1))
# Perform Clustering
micro.alg <- DSC_Sample(k = 10)
macro.alg <- DSC_Kmeans(k = 2)
pipe <- DSC_TwoStage(micro = micro.alg, macro = macro.alg)
[ 307 ]
Streaming Data Clustering Analysis in R Chapter 7
Summary
The chapter started with an overview of data at motion and data at rest, also called as the
streaming data. We further dwelled into the properties of streaming data and the challenges
it poses while processing it. We introduced the stream clustering algorithm. The famous
offline/online approach to stream clustering was discussed. Later on, we introduced various
classes in stream package and how to use them. During that process, we discussed ideas
about several data generators, DBSTREAM algorithms to find micro and macro clusters and
several metrics to assess the quality of clusters. We then introduced our use case. We went
ahead to design a clustering algorithm, with the online part based on reservoir sampling
and the offline part was handled by k-means algorithm. Finally, we described the steps
needed to take this whole setup in a real streaming environment.
In the next chapter, we will explore graph mining algorithms. We will show you how to use
the package igraph to create and manipulate graphs. We will discuss Product Network
Analysis and show how graph algorithms can assist in generating micro categories.
[ 308 ]
Analyze and Understand
8
Networks Using R
Network analysis is the study of graphs. Graphs are defined by a set of nodes or vertices
connected by edges. Both the nodes and vertices can have attributes describing them. Most
importantly, the edges can carry weight, indicating the importance of the connection. When
the directions of the edges are preserved, the graph is called a directed graph; when not
preserved, it's called an undirected graph. Network analysis, or network theory, or graph
theory provides a rich set of algorithms to analyze and understand graphs. The famous
Koenigsberg problem (http://mathworld.wolfram.com/KoenigsbergBridgeProblem.html)
introduced by Euler is one of the first graph theory problems to be studied. Koenigsberg is
an old city in Prussia (modern day Russia). The river Pregal separates the city. There are
two other islands. There are seven bridges connecting the islands and the cities. The
Koenigsberg problem was to devise a walk through the city that would cross each of those
bridges once and only once.
One graph structure that we all know today and is easy to relate to is the social network
structure formed by various social media applications such as Facebook and LinkedIn. In
these networks, people form the vertices and, when two of them are connected to each
other, an edge is drawn between those vertices. The whole internet is a graph of connected
machines. Other examples from biology include protein-protein interaction networks,
genetic interaction networks, and so on.
Analyze and Understand Networks Using R Chapter 8
When we represent a problem as a graph, it may give us a different point of view to solve
that problem. Sometimes it can make the problem simpler to solve. One such problem that
we are going to see in this chapter is assigning categories to items. More importantly, we
will understand the micro-categorization of items in a retail setting. Though our example is
from a retail setting, this technique is not limited to the retail domain. We will show how
we can leverage graphs to assign categories to items. This technique is called the Product
Network Analysis.
This chapter is loosely based on the following two papers: Product Network
Analysis – The Next Big Thing in Retail Data Mining--a white paper by Forte
Consultancy and Extending Market Basket Analysis with Graph Mining
Techniques: A Real Case, by Ivan F. Videla-Cavieres , Sebastián A. Ríos,
University of Chile, Department of Industrial Engineering, Business
Intelligence Research Center (CEINE), Santiago, Chile.
Category management is very important for retailers. Having products grouped into the
right category is the first step for retailers to manage their products. Downstream
applications such as up-selling, cross-selling, and loyalty systems can benefit tremendously
with the right category assignment. In the absence of a sophisticated product network
analysis system, product categorization is done manually and is heavily dependent on the
product features entered either by the suppliers or the procurement teams. This
categorization may not be accurate and will be heavily biased by human judgment. It's
impossible to expect concordance between two people in a task such as this one.
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1.
As we work through our example, we will introduce the R packages, igraph, and arules
that we will be using. During our code description, we will be using some of the output
printed in the console. We have included what will be printed in the console immediately
following the statement that prints the information to the console, so as not to disturb the
flow of the code.
[ 310 ]
Analyze and Understand Networks Using R Chapter 8
Graphs in R
We will use the R package, igraph, for our graph analysis needs. We will leverage the
arules package to manipulate our data. If you don't have them installed, proceed to install
them as follows:
> install.packages("arules")
> install.packages("igraph")
You can use the sessionInfo function from the utils package to look at the packages
available for you in the current session.
[ 311 ]
Analyze and Understand Networks Using R Chapter 8
After including the igraph library, we used the graph_from_literal function to create a
simple undirected graph with six nodes. The igraph package provides the plot.igraph
function to visualize the graphs. There are several ways in which we can create a graph. For
a complete list of the different methods available to create graphs, refer to http://igraph.
org/r/#docs.
By adding the + sign while defining the edges, we define the direction of the graph.
[ 312 ]
Analyze and Understand Networks Using R Chapter 8
You can see the arrows now showing the direction of the edges.
Let's look at the following code snippet to show the vertices and edges of the graph:
> E(simple.graph) # Edges
The E() and V() functions give the edges and vertices of our graph.
By calling the name property on all the vertices, we change the names of our vertices.
Alternatively, we can also use the set_vertex_attr function to add/modify an attribute;
in this example, we added a new attribute called age.
[ 313 ]
Analyze and Understand Networks Using R Chapter 8
Further, based on the age property, we change the color of the vertex:
>V(simple.graph)$color <- ifelse(V(simple.graph)$age == '11',
"blue","green")
>plot.igraph(simple.graph)
Using the color slot, we have changed the color of the vertex based on the age property.
[ 314 ]
Analyze and Understand Networks Using R Chapter 8
The preceding plot vertices are green where the age attribute is 11, otherwise they are blue.
Structural properties such as degree and strength help us understand the underlying
structure of the graph. Many graph-based algorithms use these properties.
Degree of a vertex
The number of edges adjacent, that is, either entering or exiting a vertex, is the degree of a
vertex. The degree function gives the degree of all the vertices in our graph.
We further use the E() function to refer to our graph's edges and add weights to the edges.
[ 315 ]
Analyze and Understand Networks Using R Chapter 8
Strength of a vertex
The strength of a vertex is the sum of the weights of the edges adjacent on that node. The
strength function gives us the strength of vertices in our graph. Let's look at a small code
snippet to find the degree and strength of a graph:
> degree(simple.graph)
alice bob charlie david eli
1 1 3 1 2
> E(simple.graph)$weight <- c(10,20,35,15,25,35)
> strength(simple.graph)
alice bob charlie david eli
10 20 70 35 25
The functions degree and strength in igraph package can be invoked to get degree and
strength.
Adjacency Matrix
Any graph can be represented as a matrix, called the adjacency matrix, where the rows and
columns are the vertices of the graph. The presence of a value in a particular cell indicates
an edge between the vertices. The cells can be populated with the edge weights too.
The get.adjacency function returns the incident or the adjacency matrix of our graph.
Let's look at a small code snippet to find the adjacency matrix for a graph:
> get.adjacency(simple.graph)
5 x 5 sparse Matrix of class "dgCMatrix"
alice bob charlie david eli
alice . . . . 1
bob . . 1 . .
charlie . 1 . 1 1
david . . 1 . .
eli 1 . 1 . .
>
[ 316 ]
Analyze and Understand Networks Using R Chapter 8
More networks in R
In this section, let us see some more structural properties of a graph. Till now we created
some simple graphs. It will be better to have some larger graphs to study the structural
properties.
The R code for a lot of different networks built from a diverse source of input is available
at https://github.com/igraph/igraph/tree/master/nexus/download.
Let's use the US airport network from the repository to show some structural properties of a
graph:
url <-
"http://sites.google.com/site/cxnets/US_largest500_airportnetwork.txt"
tmp <- tempdir()
dest <- paste(sep="", tmp, "/", "usairport.txt")
download.file(url, dest)
We first download the file from an internet URL and use read_graph function to
successfully create a graph. Let us now start looking at some structural properties.
Centrality of a vertex
The centrality of a vertex in a graph defines the importance of that vertex in the graph.
Centrality can be calculated in a multitude of ways. Degree centrality is calculated based on
the in-degree--the number of incoming edges--and the out-degree--the number of outgoing
edges of the graph.
In the preceding case, we use the degree function to find the degree centrality score. We
have sorted it in descending order. Airport 0 has the highest centrality score. In this case,
we have used both the in-degree and the out-degree. According to this analysis, airport 0
must be some important node in this network.
[ 317 ]
Analyze and Understand Networks Using R Chapter 8
Closeness is the inverse of farness. Another way of looking at it is that closeness is how long
will it take to pass a message from a node to all other nodes.
The following snippet shows the closeness calculation using the igraph package:
> head(sort(closeness(usairport), decreasing = TRUE),10)
We have nested multiple functions, let us untangle them. First, we use the function
closeness to get the closeness property for each vertex. We pass that output to sort
function, and sort the results by decreasing order of closeness. Finally, we use the head
function to pick the top 10 results from the sort output.
Having explored some of the properties of the graph, let us see an overview of some of the
graph algorithms available in the igraph package.
The shortest.paths method gives the shortest paths between our node alice and all the
other nodes. The node david is far way from alice.
[ 318 ]
Analyze and Understand Networks Using R Chapter 8
We see that the algorithm moved to eli and came back to alice.
We hope this gives you an overview of how to use the igraph package. Let's now move to
our use case and see how we can leverage the igraph package to solve our problem.
The Nielsen definition of a category is based on product features. Products that exhibit the
following features are put under the same category:
When analyzing purchasing behavior, several patterns emerge; some products are sold
together at the same time, some products are sold in a time sequential pattern, the sale of
one product affects the sale of another and several others. These types of product
interactions and patterns can either occur in the same category or across different
categories. This leads to formation of micro-categories. Unfortunately, there are no simple
ways to identify these micro-categories. Based on the products, market conditions, price
points, consumer preference, and many other factors, several such micro-categories may
emerge as more retail transactions aggregate.
[ 319 ]
Analyze and Understand Networks Using R Chapter 8
This data and source can be downloaded from the Packt website.
> data <- read.csv('data.csv')
> head(data)
order_id product_id
1 837080 Unsweetened Almondmilk
2 837080 Fat Free Milk
3 837080 Turkey
4 837080 Caramel Corn Rice Cakes
5 837080 Guacamole Singles
6 837080 HUMMUS 10OZ WHITE BEAN EAT WELL
The given data is in a tabular format. Every row is a tuple of order_id, representing the
transaction and product_id, the item included in that transaction. We need to transform
this data so that we have all the pairs of products and the number of transactions in which
they have occurred together. We will leverage the arules package to achieve this. This
package provides the necessary infrastructure to store, manipulate, and analyze the retail
transaction data. We have already covered the arules package in Chapter 1, Association
Rule Mining. For more information about arules package, refer to https://cran.r-
project.org/web/packages/arules/index.html.
[ 320 ]
Analyze and Understand Networks Using R Chapter 8
Data preparation
Our data preparation task involves taking the transactions data and converting it to a form
where we have product pairs and their transaction frequency. Transaction frequency is the
number of transactions in which both the products have appeared. We will use these
product pairs to build our graph. The vertices of our graph are the products. For every
product pair, an edge is drawn in the graph between the corresponding product vertices.
The weight of the edge is the transaction frequency.
We will use the arules package version 1.5-0 to help us perform this data preparation task:
> library(arules)
> transactions.obj <- read.transactions(file = 'data.csv', format =
"single",
+ sep = ",",
+ cols = c("order_id", "product_id"),
+ rm.duplicates = FALSE,
+ quote = "", skip = 0,
+ encoding = "unknown")
Warning message:
In asMethod(object) : removing duplicated items in transactions
> transactions.obj
transactions in sparse format with
6988 transactions (rows) and
16793 items (columns)
We begin with reading our transactions stored in the text file and create an arules data
structure called transactions. Let's look at the parameters of read.transactions, the
function used to create the transactions object. The first parameter, file, we pass to our text
file where we have the transactions from the retailer. The second parameter, format, can
take any of two values, single or basket, depending on how the input data is organized. In
our case, we have a tabular format with two columns--one column representing the unique
identifier for our transaction and the other column for a unique identifier representing the
products present in our transaction. This format is named as single by arules. Refer to the
arules documentation for a detailed description of all the parameters. On inspecting the
newly created transactions object transaction.obj, we see that there are 6,988
transactions and 16,793 products.
[ 321 ]
Analyze and Understand Networks Using R Chapter 8
Now that we have loaded the transaction data, let's proceed to find the product pairs:
> support <- 0.015
>
> # Frequent item sets
> parameters = list(
+ support = support,
+ minlen = 2, # Minimal number of items per item set
+ maxlen = 2, # Maximal number of items per item set
+ target = "frequent itemsets"
+ )
>
> freq.items <- apriori(transactions.obj, parameter = parameters)
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
maxlen target ext
NA 0.1 1 none FALSE TRUE 5 0.015 2
2 frequent itemsets FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
"Apriori is an algorithm for frequent itemset mining and association over transaction
databases. It proceeds by identifying the frequent individual items in the database and
extending them to larger and larger item sets as long as those itemsets appear sufficiently
often in the database." -- Wikipedia
[ 322 ]
Analyze and Understand Networks Using R Chapter 8
Generating frequent itemsets is the first phase of the apriori algorithm. We conveniently
leverage this phase of the algorithm to generate our product pairs and the number of
transactions in which they are present.
The apriori algorithm works in two phases. Finding the frequent item sets is the first
phase of the association rule mining algorithm. A group of product IDs is called an item set.
The algorithm makes multiple passes to the database; in the first pass, it finds out the
transaction frequency of all the individual items. These are item sets of order 1. Let's
introduce the first interest measure Support here:
Now in the first pass, the algorithm calculates the transaction frequency for each product.
At this stage, we have order 1 item sets. We will discard all those item sets that fall below
our support threshold. The assumption here is that items with high transaction frequency
are more interesting than the ones with very low frequency. Items with very low support
are not going to make interesting rules further down in the pipeline. Using the most
frequent items, we construct item sets with two products and find their transaction
frequency, that is, the number of transactions in which both the items are present. Once
again, we discard all the two-product item sets, also known also item sets of order 2 that
are below the given support threshold. We continue this way, until we finish. Let's look at a
quick illustration:
Pass 1:
Support = 0.1
Product, transaction frequency
{item5}, 0.4
{item6}, 0.3
{item9}, 0.2
{item11}, 0.1
[ 323 ]
Analyze and Understand Networks Using R Chapter 8
item11 will be discarded in this phase as its transaction frequency is below the support
threshold.
Pass 2:
{item5, item6}
{item5, item9}
{item6, item9}
As you can see, we have constructed item sets of order 2 using the filtered items from pass
1. We proceed to find their transaction frequency, discard item sets falling below our
minimum support threshold, and step in to pass 3, where once again, we create item sets of
order 3, calculate the transaction frequency, and perform filtering and move to pass 4. In
one of the subsequent passes, we will reach a stage where we cannot create higher order
item sets. That is when we stop.
The apriori method is used in arules packages to get the frequent items. This method
takes two parameters, one transaction object, and the second parameter is a named list. We
create a named list called parameters. Inside the named list, we have an entry for our
support threshold. We have set our support threshold to 0.015. The minlen and maxlen
parameters set a lower and upper cutoff on how many items we expect in our item sets. By
setting our minlen and maxlen to 2, we say that we need only product pairs.
The apriori method returns an itemset object. Let's now extract the product pairs and
their transaction frequency from the itemset object, freq.items:
> freq.items.df <- data.frame(item_set = labels(freq.items)
+ , support = freq.items@quality)
> freq.items.df$item_set <- as.character(freq.items.df$item_set)
> head(freq.items.df)
item_set support.support support.count
1 {Banana,Honeycrisp Apple} 0.01617058 113
2 {Banana,Organic Fuji Apple} 0.01817401 127
3 {Banana,Cucumber Kirby} 0.01788781 125
4 {Banana,Strawberries} 0.01931883 135
5 {Bag of Organic Bananas,Organic Zucchini} 0.01659989 116
6 {Organic Strawberries,Organic Whole Milk} 0.01617058 113
From the itemset object, freq.items, returned by apriori, we extract our product pairs
and their transaction frequency count. The item_set column in our
dataframe, freq.items.df refers to the product pair, the support.count column is the
actual number of transactions in which both the products were present, and
the support.support column gives the support.count value as a percentage. Notice
that we have our product pairs enclosed weirdly in a curly brace. We need them in two
different columns.
[ 324 ]
Analyze and Understand Networks Using R Chapter 8
Let's write some cleaning code to remove those braces and also separate our product pairs
into two different columns:
> library(tidyr)
> freq.items.df <- separate(data = freq.items.df, col = item_set, into =
c("item.1", "item.2"), sep = ",")
> freq.items.df[] <- lapply(freq.items.df, gsub, pattern='\\{',
replacement='')
> freq.items.df[] <- lapply(freq.items.df, gsub, pattern='\\}',
replacement='')
> head(freq.items.df)
item.1 item.2 support.support
support.count
1 Banana Honeycrisp Apple 0.0161705781339439
113
2 Banana Organic Fuji Apple 0.0181740125930166
127
3 Banana Cucumber Kirby 0.0178878076702919
125
4 Banana Strawberries 0.0193188322839153
135
5 Bag of Organic Bananas Organic Zucchini 0.0165998855180309
116
6 Organic Strawberries Organic Whole Milk 0.0161705781339439
113
We leverage the separate function from the tidyr package to split the item_set column
into two columns. As the products are separated by a comma, we specify the comma as the
separator to a separate function. Once separated, we run a regular expression on those
columns to remove the curly braces.
Let us now to create a new data frame with product pairs and weights:
> network.data <- freq.items.df[,c('item.1','item.2','support.count')]
> names(network.data) <- c("from","to","weight")
> head(network.data)
from to weight
1 Banana Honeycrisp Apple 113
2 Banana Organic Fuji Apple 127
3 Banana Cucumber Kirby 125
4 Banana Strawberries 135
5 Bag of Organic Bananas Organic Zucchini 116
6 Organic Strawberries Organic Whole Milk 113
[ 325 ]
Analyze and Understand Networks Using R Chapter 8
We retain only the item.1, item.2, and support.count columns. Next, we rename these
columns to from, to, and weight. The igraph package expects this naming convention to
create a graph object seamlessly. Finally, you can see that we have modified the data suit
igraph package's graph manipulation functions.
We leveraged the apriori function in arules to prepare our dataset. Equipped with our
dataset, let's proceed to perform product network analysis to discover micro-categories.
The core product: In a subgraph or a cluster group, the product that is most
commonly purchased in the group is termed as the core product of that group.
The connectors: These are products that connect two subgraphs or clusters
together. They are the ones that are typically bought first, if a customer starts
shopping for products in that subgraph. Promoting this product as a part of
cross-selling can influence customers who have never bought any products from
this subgraph of products to start purchasing products present in this subgraph.
With our network.data data frame prepared in the exact manner we need, let's proceed to
build a graph of products:
> set.seed(29)
> my.graph <- graph_from_data_frame(network.data)
> plot.igraph(my.graph,
+ layout=layout.fruchterman.reingold,
+ vertex.label.cex=.5,
+ edge.arrow.size=.1)
>
[ 326 ]
Analyze and Understand Networks Using R Chapter 8
[ 327 ]
Analyze and Understand Networks Using R Chapter 8
Having built the graph, let's proceed to perform the clustering exercise, the second step in
product network analysis. The clustering of graphs is performed by community detection
algorithms. Communities discovered by the graphs are considered as clusters. Though there
are some small nuances in equating communities to clusters, in this chapter, we will stick to
the simple principle that the communities we discover in our graphs are going to be the
clusters we want to consider for our product network analysis.
Community detection algorithms: These are a group of algorithms that, when applied on a
graph, produce the most likely communities present in the graph. Traditional algorithms,
such as hierarchical clustering, use similarity-based approaches to arrive at the
communities. The more recent algorithms leverage several structural properties of the
graph.
Walktrap algorithm: This algorithm is based on the random walk principle. The
fundamental assumption with the random walk algorithm is as follows. A community
typically has very few edges leading outside of the community. So if we perform a random
walk from a vertex, we are more likely to stay within the community. We will use this
algorithm to detect communities for our use case.
> random.cluster
IGRAPH clustering walktrap, groups: 2, mod: 0.26
+ groups:
$`1`
[1] "Banana" "Large Lemon" "Organic Avocado"
"Honeycrisp Apple"
[5] "Organic Fuji Apple" "Cucumber Kirby" "Strawberries"
"Organic Whole Milk"
[9] "Limes"
$`2`
[1] "Bag of Organic Bananas" "Organic Strawberries" "Organic Hass
Avocado" "Organic Raspberries"
[ 328 ]
Analyze and Understand Networks Using R Chapter 8
We pass our graph to the walktrap.community function. We then move the cluster
membership and cluster members to a data frame, groupings.df. Our random walk
algorithm produced two subgraphs or communities.
[ 329 ]
Analyze and Understand Networks Using R Chapter 8
[ 330 ]
Analyze and Understand Networks Using R Chapter 8
Let's start with the large cluster in the left side of the picture. It resembles a star schema,
with Banana occupying the center. Let's confirm this by looking at some properties of the
Banana vertex.
If we quickly look at the adjacency matrix, it appears that Banana is the Core Product for this
cluster. Banana is connected to most of the other products.
Let's look at the strength and degree for the Banana vertex:
> strength(my.graph)
Banana Bag of Organic Bananas Organic Strawberries
Organic Hass Avocado
1697 941 1191
789
Organic Raspberries Large Lemon Organic Avocado
Organic Baby Spinach
414 288 422
845
Honeycrisp Apple Organic Fuji Apple Cucumber Kirby
Strawberries
113 127 125
135
Organic Zucchini Organic Whole Milk Limes
116 233 270
> degree(my.graph)
[ 331 ]
Analyze and Understand Networks Using R Chapter 8
Banana has very high degree and strength, and thus there is more evidence to assign
Banana as the Core Product for this cluster. Though it was evident that Banana is the most
important product in the cluster from the graph plot, it gives us more confidence to look at
some of the properties of the Banana vertex.
Having decided upon the Core Product as banana for the left cluster, let's do some random
walking from the Banana vertex. We select an edge uniformly and randomly from all the
out-degrees of Banana and do a specified number of steps. To find out more about random
walk click http://igraph.org/r/doc/random_walk.html.
> for(var in seq(1,5)){
+ print (random_walk(my.graph, start = "Banana", steps = 2))
+ }
+ 2/15 vertices, named, from 16e82b7:
[1] Banana Organic Strawberries
+ 2/15 vertices, named, from 16e82b7:
[1] Banana Organic Avocado
+ 2/15 vertices, named, from 16e82b7:
[1] Banana Cucumber Kirby
+ 2/15 vertices, named, from 16e82b7:
[1] Banana Limes
+ 2/15 vertices, named, from 16e82b7:
[1] Banana Strawberries
> random_walk(my.graph, start = "Banana", steps = 3)
+ 3/15 vertices, named, from 16e82b7:
[1] Banana Organic Baby Spinach Organic Hass Avocado
> random_walk(my.graph, start = "Banana", steps = 4)
+ 4/15 vertices, named, from 16e82b7:
[1] Banana Organic Avocado Organic Baby Spinach Organic
Strawberries
> random_walk(my.graph, start = "Banana", steps = 5)
[ 332 ]
Analyze and Understand Networks Using R Chapter 8
This random walk can expose some of the product affinities quickly without a lot of
computation. Imagine working on large graphs with thousands of vertices; these random
walks performed multiple times can quickly unearth product affinities.
Another data source that can be used to experiment with the techniques
described in this chapter is Amazon's online retail page which is available
for download."
#https://snap.stanford.edu/data/amazon0302.html
The Organic Avacado and Banana products in the left cluster can be termed as The
Connector. They have a path into the right-side cluster. Promoting these products can
induce the customers of the left cluster to start exploring or buying the products in the right
cluster.
This brings us to the end of this section. Here, we were able to use the random walk
principle-based community detection algorithm to find communities in our graph. These
communities can be the new micro-categories. Further, we also identified our core and
connector vertices.
[ 333 ]
Analyze and Understand Networks Using R Chapter 8
".csv"
)
)
, dataTableOutput("transactions")
),
tabPanel("Product Pairs"
,dataTableOutput("ppairs")),
tabPanel("Community"
,plotOutput("community"))
)
)
We have three panels. In the first panel, we select a product transaction file and display it.
In our second panel, we show the product pairs and their transaction counts. In the final
panel, we display the communities we have discovered.
[ 334 ]
Analyze and Understand Networks Using R Chapter 8
As soon as a file is uploaded, this file is read into a data frame, trans.df.
[ 335 ]
Analyze and Understand Networks Using R Chapter 8
[ 336 ]
Analyze and Understand Networks Using R Chapter 8
Using the transaction object, the apriori function is invoked to get the product pairs. The
output of the apriori is carefully formatted and a final data frame, network.data, is
created.
The rest of the functions in the server renders these outputs to the respective slots in the UI.
Using the file selector, we can select the transaction file. The transaction file should be a
.csv file with two columns. One for the order_id and the other one for the product_id.
In this version, we need to maintain the column names as order_id and product_id.
[ 337 ]
Analyze and Understand Networks Using R Chapter 8
[ 338 ]
Analyze and Understand Networks Using R Chapter 8
[ 339 ]
Analyze and Understand Networks Using R Chapter 8
## Graph properties
E(simple.graph) # Edges
V(simple.graph) # Vertices
## Graph attributes
V(simple.graph)$name <- c('alice', 'bob','charlie','david',
'eli','francis')
simple.graph <- set_vertex_attr(simple.graph ,"age", value = c('11',
'11','15','9', '8','11'))
plot.igraph(simple.graph)
V(simple.graph)$color <- ifelse(V(simple.graph)$age == '11',
"blue","green")
plot.igraph(simple.graph)
# Structural properties
degree(simple.graph) # degree of nodes
E(simple.graph)$weight <- c(10,20,35,15,25,35)
strength(simple.graph) # strength of nodes
get.adjacency(simple.graph) # adjacency matrix
[ 340 ]
Analyze and Understand Networks Using R Chapter 8
encoding = "unknown")
transactions.obj
# Interest Measures
support <- 0.015
## Clustering
[ 341 ]
Analyze and Understand Networks Using R Chapter 8
plot(random.cluster,my.graph,
layout=layout.fruchterman.reingold,
vertex.label.cex=.5,
edge.arrow.size=.1)
get.adjacency(my.graph)
strength(my.graph)
degree(my.graph)
## Random walks
for(var in seq(1,5)){
print (random_walk(my.graph, start = "Banana", steps = 2))
}
library(shiny)
library(arules)
library(igraph, quietly = TRUE)
[ 342 ]
Analyze and Understand Networks Using R Chapter 8
[ 343 ]
Analyze and Understand Networks Using R Chapter 8
ui <- fluidPage(
navbarPage("Product Pairs",
tabPanel("Transactions"
, fileInput("datafile", "select transactions csv
file",
accept = c(
"text/csv",
"text/comma-separated-
values,text/plain",
".csv"
)
)
, dataTableOutput("transactions")
),
tabPanel("Product Pairs"
,dataTableOutput("ppairs")),
tabPanel("Community"
,plotOutput("community"))
)
)
Summary
We started this chapter by explaining our problem and dataset and conducted product
categorization using network/graph theory. We introduced the R package, igraph, to
create and manipulate graphs. We set out discovering some of the properties of the graph
and also a little bit about the graphs and random walks. We used the arules package to
find frequent pairs of items and finally applied the graph clustering algorithm to our data to
discover product categories. We were also able to identify important vertices performing
the role of the core and connectors.
[ 344 ]
Index
[ 346 ]
supervised learning 263 machine learning method 275
unsupervised learning 261 RShiny application source code 277
macro-cluster 283 weights-based method 274
Manhattan distance 78 R project code 155
market basket analysis 11 R vignettes
matrix decomposition reference 180
reference 120 ratings matrix 137
matrix factorization 120 recommender systems
multi-layer perceptron 171 about 7
MXNet R package transactions 8
about 175 web application 10
reference 175 recommenderlab package
MXNet symbolic API about 122, 123
reference 177 popular approach 124, 126
MXNet RecordLinkage package
symbolic programming 178 feature generation 247
usage, demonstrating 245
N reservoir sampling
n fold cross-validation scheme 148 about 303
negative association rules 50, 51 reference 303
network analysis 309 retailer use case 10
networks, in R 317 RShiny application
news aggregator dataset assembling 234, 235, 236
reference 72 building 268, 270, 333, 336, 338
news aggregator rules visualization 53, 56
use case 72
non-seasonal time series 165 S
normalization 139 search results
ranking 94
P searching
Pearson coefficient 118, 119 about 86
polarity 77 defuzzification 98
polarity scores 89 fuzzification 95, 96
polarized context clusters 91, 223 fuzzy logic 95
probability density function Jaccard index 92
reference 215 Jaccard's distance 91
Product Network Analysis 310 polarity scores 89
product network analysis rules, defining 97
about 326, 328 rules, evaluating 97
demonstrating 340 seasonal time series 166
sensor network example, in pubnub
R reference 298
sentiment annotated datasets
R code
reference 219
expectation maximization method 273
sentiment classification
feature generation R code 272
[ 347 ]
about 218
approach 219 T
dictionary methods 219 term frequency (tf) 81, 226
machine learning methods 219 term-frequency inverse document frequency (tf-idf)
sentiment classifier 226
building 230, 231, 232, 233 test split
sentiment mining training 188, 191, 193, 196, 197, 200
reference 219 text pre-processing
similarity index about 224
bag-of-words 80 Delta TFIDF 227
cosine similarity 85 term-frequency inverse document frequency (tf-
document frequency 81 idf) 226
inverse document frequency (IDF) 81 TFIDF 82, 84
term frequency 81 time series data 164
TFIDF 82, 84 time series
similarity measure 118 about 163
Singular Value Decomposition (SVD) 121 as regression problem 167, 169, 170
SocialSent non-seasonal time series 165
reference 219 seasonal time series 166
softmax activation train model
about 182 about 145, 146, 147
use case 184 factor-based models 153
Soundex item-based models 152
reference 250 user-based models 149
stationary time series 167 transactions 8, 321
stochastic record linkage twitteR package
about 252 reference 217
expectation maximization method 252, 254, 255 Twitter text 217, 218
weights-based method 258, 259
stream clustering 282 U
stream data undirected graph 309
about 280 unsupervised learning 261
challenges 280 use case 126, 129, 130, 131, 298
stream MOA algorithms, in Java user rating matrix 116
reference 293 user-based filtering 118
stream package user-based models 149
about 283 user-based recommendation (UBCF) 150
components 284
data stream data (DSD) 284 W
data stream task (DST) 293, 295 walktrap algorithm 328, 329
supervised learning 263 web application 10
symbolic programming, in MXNet weighted association rule mining 35, 36, 38, 40
about 178 weighted transactions 9
softmax activation 182 Wordnet
reference 219
[ 348 ]
X Z
Xavier initialization z-score normalization
reference 174 reference 140