R For Data Science
R For Data Science
My first impression of R was that it’s just a software for statistical computing.
Good thing, I was wrong! R has enough provisions to implement machine
learning algorithms in a fast and simple manner.
This is a complete tutorial to learn data science and machine learning using
R. By the end of this tutorial, you will have a good exposure to building
predictive models using machine learning on your own.
1. Basics of R Programming
Why learn R ?
I don’t know if I have a solid reason to convince you, but let me share what got
me started. I have no prior coding experience. Actually, I never had computer
science in my subjects. I came to know that to learn data science, one must learn
either R or Python as a starter. I chose the former. Here are some benefits I
found after using R:
You could download and install the old version of R. But, I’d insist you to start
with RStudio. It provides much better coding experience. For Windows users, R
Studio is available for Windows Vista and above versions. Follow the steps
below for installing R Studio:
1. Go to https://www.rstudio.com/products/rstudio/download/
2. In ‘Installers for Supported Platforms’ section, choose and click the R
Studio installer based on your operating system. The download should
begin as soon as you click.
3. Click Next..Next..Finish.
4. Download Complete.
5. To Start R Studio, click on its desktop icon or use ‘search windows’ to
access the program. It looks like this:
Let’s quickly understand the interface of R Studio:
1. R Console: This area shows the output of code you run. Also, you can
directly write codes in console. Code entered directly in R console cannot
be traced later. This is where R script comes to use.
2. R Script: As the name suggest, here you get space to write codes. To run
those codes, simply select the line(s) of code and press Ctrl + Enter.
Alternatively, you can click on little ‘Run’ button location at top right
corner of R Script.
3. R environment: This space displays the set of external elements added.
This includes data set, variables, vectors, functions etc. To check if data
has been loaded properly in R, always look at this area.
4. Graphical Output: This space display the graphs created during
exploratory data analysis. Not just graphs, you could select packages,
seek help with embedded R’s official documentation.
The sheer power of R lies in its incredible packages. In R, most data handling
tasks can be performed in 2 ways: Using R packages and R base functions. In
this tutorial, I’ll also introduce you with the most handy and powerful R
packages. To install a package, simply type:
install.packages("package name")
As a first time user, a pop might appear to select your CRAN mirror (country
server), choose accordingly and press OK.
Note: You can type this either in console directly and press ‘Enter’ or in R
script and click ‘Run’.
Basic Computations in R
Let’s begin with basics. To get familiar with R coding environment, start with
some basic calculations. R console can be used as an interactive calculator too.
Type the following in your console:
>2+3
>5
>6/3
> 2
> (3*8)/(2*3)
>4
> log(12)
> 1.07
Similarly, you can experiment various combinations of calculations and get the
results. In case, you want to obtain the previous calculation, this can be done in
two ways. First, click in R console, and press ‘Up / Down Arrow’ key on your
keyboard. This will activate the previously executed commands. Press Enter.
But, what if you have done too many calculations ? It would be too painful to
scroll through every command and find it out. In such situations, creating
variable is a helpful way.
In R, you can create a variable using <- or = sign. Let’s say I want to create a
variable x to compute the sum of 7 and 8. I’ll write it as:
> x <- 8 + 7
>x
> 15
Once we create a variable, you no longer get the output directly (like
calculator), unless you call the variable in the next line. Remember, variables
can be alphabets, alphanumeric but not numeric. You can’t create numeric
variables.
2. Essentials of R Programming
Understand and practice this section thoroughly. This is the building block of
your R programming knowledge. If you get this right, you would face less
trouble in debugging.
1. Character
2. Numeric (Real Numbers)
3. Integer (Whole Numbers)
4. Complex
5. Logical (True / False)
Let’s understand the concept of object and attributes practically. The most basic
object in R is known as vector. You can create an empty vector using vector().
Remember, a vector contains object of same class.
For example: Let’s create vectors of different classes. We can create vector
using c() or concatenate command also.
Data Types in R
R has various type of ‘data types’ which includes vector (numeric, integer etc),
matrices, data frames and list. Let’s understand them one by one.
Vector: As mentioned above, a vector contains object of same class. But, you
can mix objects of different classes too. When objects of different classes are
mixed in a list, coercion occurs. This effect causes the objects of different types
to ‘convert’ into one class. For example:
> qt <- c("Time", 24, "October", TRUE, 3.33) #character
> ab <- c(TRUE, 24) #numeric
> cd <- c(2.5, "May") #character
> class(qt)
"character"
Similarly, you can change the class of any vector. But, you should pay attention
here. If you try to convert a “character” vector to “numeric” , NAs will be
introduced. Hence, you should be careful to use this command.
List: A list is a special type of vector which contain elements of different data
types. For example:
> my_list <- list(22, "ab", TRUE, 1 + 2i)
> my_list
[[1]]
[1] 22
[[2]]
[1] "ab"
[[3]]
[1] TRUE
[[4]]
[1] 1+2i
As you can see, the output of a list is different from a vector. This is because, all
the objects are of different types. The double bracket [[1]] shows the index of
first element and so on. Hence, you can easily extract the element of lists
depending on their index. Like this:
> my_list[[3]]
> [1] TRUE
You can use [] single bracket too. But, that would return the list element with its
index number, instead of the result above. Like this:
> my_list[3]
> [[1]]
[1] TRUE
Matrices: When a vector is introduced with row and column i.e. a dimension
attribute, it becomes a matrix. A matrix is represented by set of rows and
columns. It is a 2 dimensional data structure. It consist of elements of same
class. Let’s create a matrix of 3 rows and 2 columns:
> dim(my_matrix)
[1] 3 2
> attributes(my_matrix)
$dim
[1] 3 2
As an interesting fact, you can also create a matrix from a vector. All you need
to do is, assign dimension dim() later. Like this:
> age <- c(23, 44, 15, 12, 31, 16)
> age
[1] 23 44 15 12 31 16
> class(age)
[1] "matrix"
You can also join two vectors using cbind() and rbind() functions. But, make
sure that both vectors have same number of elements. If not, it will return NA
values.
Data Frame: This is the most commonly used member of data types family. It
is used to store tabular data. It is different from matrix. In a matrix, every
element must have same class. But, in a data frame, you can put list of vectors
containing different classes. This means, every column of a data frame acts like
a list. Every time you will read data in R, it will be stored in the form of a data
frame. Hence, it is important to understand the majorly used commands on data
frame:
> dim(df)
[1] 4 2
> str(df)
'data.frame': 4 obs. of 2 variables:
$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
$ score: num 67 56 87 91
> nrow(df)
[1] 4
> ncol(df)
[1] 2
Let’s understand the code above. df is the name of data frame. dim() returns the
dimension of data frame as 4 rows and 2 columns. str() returns the structure of a
data frame i.e. the list of variables stored in the data
frame. nrow() and ncol() return the number of rows and number of columns in a
data set respectively.
Here you see “name” is a factor variable and “score” is numeric. In data
science, a variable can be categorized into two types: Continuous and
Categorical.
Continuous variables are those which can take any form such as 1, 2, 3.5, 4.66
etc. Categorical variables are those which takes only discrete values such as 2,
5, 11, 15 etc. In R, categorical values are represented by factors. In df, name is a
factor variable having 4 unique levels. Factor or categorical variable are
specially treated in a data set. Let’s now understand the concept of missing
values in R. This is one of the most painful yet crucial part of predictive
modeling. You must be aware of all techniques to deal with them. Missing
values in R are represented by NA and NaN. Now we’ll check if a data set has
missing values (using the same data frame df).
> df[1:2,2] <- NA #injecting NA at 1st, 2nd row and 2nd column of df
> df
name score
1 ash NA
2 jane NA
3 paul 87
4 mark 91
> is.na(df) #checks the entire data set for NAs and return logical output
name score
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE FALSE
[4,] FALSE FALSE
> table(is.na(df)) #returns a table of logical output
FALSE TRUE
6 2
Missing values hinder normal calculations in a data set. For example, let’s say,
we want to compute the mean of score. Since there are two missing values, it
can’t be done directly. Let’s see:
mean(df$score)
[1] NA
> mean(df$score, na.rm = TRUE)
[1] 89
The use of na.rm = TRUE parameter tells R to ignore the NAs and compute the
mean of remaining values in the selected column (score). To remove rows with
NA values in a data frame, you can use na.omit:
Control Structures in R
As the name suggest, a control structure ‘controls’ the flow of code / commands
written inside a function. A function is a set of multiple commands written to
automate a repetitive coding task.
For example: You have 10 data sets. You want to find the mean of ‘Age’
column present in every data set. This can be done in 2 ways: either you write
the code to compute mean 10 times or you simply create a function and pass the
data set to it.
if, else – This structure is used to test a condition. Below is the syntax:
if (<condition>){
##do something
} else {
##do something
}
Example
#initialize a variable
N <- 10
Example
#initialize a vector
y <- c(99,45,34,65,76,23)
#initialize a condition
Age <- 12
There are other control structures as well but are less frequently used than
explained above. Those structures are:
Note: If you find the section ‘control structures’ difficult to understand, not to
worry. R is supported by various packages to compliment the work done by
control structures.
Useful R Packages
Out of ~7800 packages listed on CRAN, I’ve listed some of the most powerful
and commonly used packages in predictive modeling in this article. Since, I’ve
already explained the method of installing packages, you can go ahead and
install them now. Sooner or later you’ll need them.
Importing Data: R offers wide range of packages for importing data available
in any format such as .txt, .csv, .json, .sql etc. To import large files of data
quickly, it is advisable to install and
use data.table, readr, RMySQL, sqldf, jsonlite.
Data Visualization: R has in built plotting commands as well. They are good to
create simple graphs. But, becomes complex when it comes to creating
advanced graphics. Hence, you should install ggplot2.
From this section onwards, we’ll dive deep into various stages of predictive
modeling. Hence, make sure you understand every aspect of this section. In case
you find anything difficult to understand, ask me in the comments section
below.
Data Exploration is a crucial stage of predictive model. You can’t build great
and practical models unless you learn to explore the data from begin to end.
This stage forms a concrete foundation for data manipulation (the very next
stage). Let’s understand it in R.
Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This
data always contains less number of observations than train data set. Also, it
does not include ‘response variable’.
Right now, you should download the data set. Take a good look at train and test
data. Cross check the information shared above and then proceed.
As a beginner, I’ll advise you to keep the train and test files in your working
directly to avoid unnecessary directory troubles. Once the directory is set, we
can easily import the .csv files using commands below.
#Load Datasets
train <- read.csv("Train_UWu5bXk.csv")
test <- read.csv("Test_u94Q5KV.csv")
In fact, even prior to loading data in R, it’s a good practice to look at the data in
Excel. This helps in strategizing the complete prediction modeling process. To
check if the data set has been loaded successfully, look at R environment. The
data can be seen there. Let’s explore the data quickly.
> dim(test)
[1] 5681 11
We have 8523 rows and 12 columns in train data set and 5681 rows and 11
columns in data set. This makes sense. Test data should always have one
column less (mentioned above right?). Let’s get deeper in train data set now.
#check the variables and their types in train
> str(train)
'data.frame': 8523 obs. of 12 variables:
$ Item_Identifier : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122
1298 759 697 739 441 991 ...
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
$ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6
...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4
2 6 8 3 ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985
2002 2007 ...
$ Outlet_Size : Factor w/ 4 levels "","High","Medium",..: 3 3 3 1 2 3 2 3 1 1 ...
$ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3
2 2 ...
$ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
$ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
To begin with, I’ll first check if this data has missing values. This can be done
by using:
> table(is.na(train))
FALSE TRUE
100813 1463
In train data set, we have 1463 missing values. Let’s check the variables in
which these values are missing. It’s important to find and locate these missing
values. Many data scientists have repeatedly advised beginners to pay close
attention to missing value in data exploration stages.
> colSums(is.na(train))
Item_Identifier Item_Weight
0 1463
Item_Fat_Content Item_Visibility
0 0
Item_Type Item_MRP
0 0
Outlet_Identifier Outlet_Establishment_Year
0 0
Outlet_Size Outlet_Location_Type
0 0
Outlet_Type Item_Outlet_Sales
0 0
Hence, we see that column Item_Weight has 1463 missing values. Let’s
get more inferences from this data.
> summary(train)
Here are some quick inferences drawn from variables in train data set:
I’m sure you would understand these variables better when explained visually.
Using graphs, we can analyze the data in 2 ways: Univariate Analysis and
Bivariate Analysis.
Univariate analysis is done with one variable. Bivariate analysis is done with
two variables. Univariate analysis is a lot easy to do. Hence, I’ll skip that part
here. I’d recommend you to try it at your end. Let’s now experiment doing
bivariate analysis and carve out hidden insights.
For visualization, I’ll use ggplot2 package. These graphs would help us
understand the distribution and frequency of variables in the data set.
From this graph, we can infer that Fruits and Vegetables contribute to the
highest amount of outlet sales followed by snack foods and household products.
This information can also be represented using a box plot chart. The benefit of
using a box plot is, you get to see the outlier and mean deviation of
corresponding levels of a variable (shown below).
> dim(train)
[1] 8523 12
> dim(test)
[1] 5681 11
Test data set has one less column (response variable). Let’s first add the
column. We can give this column any value. An intuitive approach would be to
extract the mean value of sales from train data set and use it as placeholder for
test variable Item _Outlet_ Sales. Anyways, let’s make it simple for now. I’ve
taken a value 1. Now, we’ll combine the data sets.
FALSE
14204
Using the commands above, I’ve assigned the name ‘Other’ to unnamed level
in Outlet_Size variable. Rest, I’ve simply renamed the various levels of
Item_Fat_Content.
4. Data Manipulation in R
Let’s call it as, the advanced level of data exploration. In this section we’ll
practically learn about feature engineering and other useful aspects.
Feature Engineering: This component separates an intelligent data scientist
from a technically enabled data scientist. You might have access to large
machines to run heavy computations and algorithms, but the power delivered by
new features, just can’t be matched. We create new variables to extract and
provide as much ‘new’ information to the model, to help it make accurate
predictions.
If you have been thinking all this time, great. But now is the time to think
deeper. Look at the data set and ask yourself, what else (factor) could influence
Item_Outlet_Sales ? Anyhow, the answer is below. But, I want you to try it out
first, before scrolling down.
1. Count of Outlet Identifiers – There are 10 unique outlets in this data. This
variable will give us information on count of outlets in the data set. More the
number of counts of an outlet, chances are more will be the sales contributed by
it.
> library(dplyr)
> a <- combi%>%
group_by(Outlet_Identifier)%>%
tally()
> head(a)
Source: local data frame [6 x 2] Outlet_Identifier n
(fctr) (int)
1 OUT010 925
2 OUT013 1553
3 OUT017 1543
4 OUT018 1546
5 OUT019 880
6 OUT027 1559
As you can see, dplyr package makes data manipulation quite effortless. You no
longer need to write long function. In the code above, I’ve simply stored the
new data frame in a variable a. Later, the new column Outlet_Count is added in
our original ‘combi’ data set.
This suggests that outlets established in 1999 were 14 years old in 2013 and so
on.
4. Item Type New – Now, pay attention to Item_Identifiers. We are about to
discover a new trend. Look carefully, there is a pattern in the identifiers starting
with “FD”,”DR”,”NC”. Now, check the corresponding Item_Types to these
identifiers in the data set. You’ll discover, items corresponding to “DR”, are
mostly eatables. Items corresponding to “FD”, are drinks. And, item
corresponding to “NC”, are products which can’t be consumed, let’s call them
non-consumable. Let’s extract these variables into a new variable representing
their counts.
Here I’ll use substr(), gsub() function to extract and rename the variables
respectively.
Let’s now add this information in our data set with a variable name
‘Item_Type_New.
I’ll leave the rest of feature engineering intuition to you. You can think of more
variables which could add more information to the model. But make sure, the
variable aren’t correlated. Since, they are emanating from a same set of variable,
there is a high chance for them to be correlated. You can check the same in R
using cor() function.
Label Encoding and One Hot Encoding
Just, one last aspect of feature engineering left. Label Encoding and One Hot
Encoding.
One Hot Encoding, in simple words, is the splitting a categorical variable into
its unique levels, and eventually removing the original variable from data set.
Confused ? Here’s an example: Let’s take any categorical variable, say, Outlet_
Location_Type. It has 3 levels. One hot encoding of this variable, will create 3
different variables consisting of 1s and 0s. 1s will represent the existence of
variable and 0s will represent non-existence of variable. Let look at a sample:
This was the demonstration of one hot encoding. Hope you have understood the
concept now. Let’s now apply this technique to all categorical variables in our
data set (excluding ID variable).
>library(dummies)
>combi <- dummy.data.frame(combi, names =
c('Outlet_Size','Outlet_Location_Type','Outlet_Type', 'Item_Type_New'),
sep='_')
With this, I have shared 2 different methods of performing one hot encoding in
R. Let’s check if encoding has been done.
As you can see, after one hot encoding, the original variables are removed
automatically from the data set.
Finally, we’ll drop the columns which have either been converted using other
variables or are identifier variables. This can be accomplished using select from
dplyr package.
In this section, I’ll cover Regression, Decision Trees and Random Forest. A
detailed explanation of these algorithms is outside the scope of this article.
These algorithms have been satisfactorily explained in our previous
articles. I’ve provided the links for useful resources.
As you can see, we have encoded all our categorical variables. Now, this data
set is good to take forward to modeling. Since, we started from Train and Test,
let’s now divide the data sets.
Let’s now build out first regression model on this data set. R uses lm() function
for regression.
In our case, I could find our new variables aren’t helping much i.e. Item count,
Outlet Count and Item_Type_New. Neither of these variables are significant.
Significant variables are denoted by ‘*’ sign.
> cor(new_train)
Alternatively, you can also use corrplot package for some fancy correlation
plots. Scrolling through the long list of correlation coefficients, I could find a
deadly correlation coefficient:
#load directory
> path <- "C:/Users/manish/desktop/Data/February 2016"
> setwd(path)
#load data
> train <- read.csv("train_Big.csv")
> test <- read.csv("test_Big.csv")
#impute 0 in item_visibility
> combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0,
median(combi$Item_Visibility), combi$Item_Visibility)
#linear regression
> linear_model <- lm(Item_Outlet_Sales ~ ., data = new_train)
> summary(linear_model)
Now we have got R² = 0.5623. This teaches us that, sometimes all you need is
simple thought process to get high accuracy. Quite a good improvement from
previous model. Next, time when you work on any model, always remember to
start with a simple model.
Let’s check out regression plot to find out more ways to improve this model.
> par(mfrow=c(2,2))
> plot(linear_model)
You can zoom these graphs in R Studio at your end. All these plots have a
different story to tell. But the most important story is being portrayed by
Residuals vs Fitted graph.
Residual values are the difference between actual and predicted outcome values.
Fitted values are the predicted values. If you see carefully, you’ll discover it as a
funnel shape graph (from right to left ). The shape of this graph suggests that
our model is suffering from heteroskedasticity (unequal variance in error terms).
Had there been constant variance, there would be no pattern visible in this
graph.
This model can be further improved by detecting outliers and high leverage
points. For now, I leave that part to you! I shall write a separate post on
mysteries of regression soon. For now, let’s check our RMSE so that we can
compare it with other algorithms demonstrated below.
> install.packages("Metrics")
> library(Metrics)
> rmse(new_train$Item_Outlet_Sales, exp(linear_model$fitted.values))
[1] 1140.004
Let’s proceed to decision tree algorithm and try to improve our RMSE score.
Decision Trees
Before you start, I’d recommend you to glance through the basics of decision
tree algorithms. In R, decision tree algorithm can be implemented using rpart
package. In addition, we’ll use caret package for doing cross validation. Cross
validation is a technique to build robust models which are not prone
to overfitting. In R, decision tree uses a complexity parameter (cp). It measures
the tradeoff between model complexity and accuracy on training set. A smaller
cp will lead to a bigger tree, which might overfit the model. Conversely, a large
cp value might underfit the model. Underfitting occurs when the model does not
capture underlying trends properly. Let’s find out the optimum cp value for our
model with 5 fold cross validation.
#decision tree
> tree_model <- train(Item_Outlet_Sales ~ ., data = new_train, method =
"rpart", trControl = fitControl, tuneGrid = cartGrid)
> print(tree_model)
The final value for cp = 0.01. You can also check the table populated in console
for more information. The model with cp = 0.01 has the least RMSE. Let’s now
build a decision tree with 0.01 as complexity parameter.
As you can see, our RMSE has further improved from 1140 to 1102.77 with
decision tree. To improve this score further, you can further tune the
parameters for greater accuracy.
Random Forest
Let’s do it!
If you notice, you’ll see I’ve used method = “parRF”. This is parallel random
forest. This is parallel implementation of random forest. This package
causes your local machine to take less time in random forest computation.
Alternatively, you can also use method = “rf” as a standard random forest
function.
Now we’ve got the optimal value of mtry = 15. Let’s use 1000 trees for
computation.
This model can be further improved by tuning parameters. Also, Let’s make out
first submission with our best RMSE score by decision tree.
When predicted on out of sample data, our RMSE has come out to be 1174.33.