Getting Started With R

GETTING STARTED WITH R
Download -
1.
To download R, visit official site of R - http://www.r-project.org/.
2.
Click on the link "CRAN" located at the left hand side of the page.
3.
Choose your country and click on the link available for your location.
4.
Click Download R for Windows (For Windows)
5.
click base
Follow the steps shown in the image below.
R Screen
Basics of R Programming
1. Write your first equation in R
Enter your equation in the command window after the ">" (without quotes)
and hit ENTER.
Hit ENTER after completing your equation in the command window to see
result. The [1] tells you the resulting value is the first result.
2. The # character at the beginning of a line signifies a comment.
3. The operator "<" (without quotes) is equivalent to "=" sign . You can use either of
the operators.
R syntax editor
Go to File >> Click on New Script >> In the new R Editorwindow, write a
code and Press F5 to run it after highlighting the code.
4. The getwd() function shows the working directory
5. R uses forward slashes instead of backward slashes in filenames (as

shown in the image above).
6. The setwd() function tells R where you would like your files to save
(changes the working directory).
setwd ("C:/Users/Deepanshu/Downloads")
Notice the forward slash is used in the filename above.
7. The c function is widely used to combine values to form a vector.
8. Press CTRL + ENTER to enter a continued line. The "+" operator shows
line continuation.
9. R uses NA to represent Not Available, or missing values.
10. To calculate sum excluding NA, use na.rm = TRUE (By default, it is FALSE).
11. The form 1:10 generates the integers from 1 to 10.
12. R is case-sensitive, so you have to use the exact case that the program
requires.
13. To get help for a certain function such as sum, use the form: help
(sum)
14. Object names in R can be any length consisting of letters, numbers,
underscores _ or the period .
15. Object names in R should begin with a letter.
16. Unlike SAS and SPSS, R has several different data structures including vectors,
factors, data frames, matrices, arrays, and lists. The data frame is most like a dataset in
SAS.
17. Editing functions in R

Use fix() function
You can use fix() function and give the name of an existing function, R shows you
the code for that function in a NotePad window and you can type whatever you like.
For example : fix(x)
When you leave NotePad, say "Yes" to the question "Do you want to save changes?"
(unless you want to discard your changes). Don't "Save As...", just "Save"; R will update
your function automatically.
18. Retrieve your previous command

You can retrieve it with the UP arrow key and edit it to run again.
19. Install packages (or Add-Ins)

install.packages("sas7bdat")
To use the installed package, add the following line of code
library("sas7bdat")
20. Save data to R

save.image("mywork.RData")
21. To tell R which data set to use

attach(mydata)
If you finish with that dataset and wish to use another, you can detach it with:
detach( mydata )
Data Types
Unlike SAS and SPSS, R has several different data types (structures) including vectors,
factors, data frames, matrices, arrays, and lists. The data frame is most like a dataset in
SAS.
1. Vectors
A vector is an object that contains a set of values called its elements.
Numeric vector
x <- c(1,2,3,4,5,6)
The operator < is equivalent to "=" sign.
Character vector
State <- c("DL", "MU", "NY", "DL", "NY", "MU")
To calculate frequency for State vector, you can use table function.
To calculate mean for a vector, you can use mean function.
Since the above vector contains a NA (not available) value, the mean
function returns NA.
To calculate mean for a vector excluding NA values, you can
include na.rm = TRUE parameter in mean function.
You can use subscripts to refer elements of a vector.
Convert a column "x" to numeric

data$x = as.numeric(data$x)
2. Factors
R has a special data structure to store categorical variables. It tells R that a
variable is nominal or ordinal by making it a factor.
Simplest form of the factor function :
Ideal form of the factor function :
The factor function has three parameters:

1.Vector Name
2.Values (Optional)
3.Value labels (Optional)
Convert a column "x" to factor
data$x = as.factor(data$x)
3. Matrices
All values in columns in a matrix must have the same mode (numeric, character, etc.)
and the same length.
The cbind function joins columns together into a matrix. See the usage below
The numbers to the left side in brackets are the row numbers. The form [1, ] means that
it is row number one and the blank following the comma means that R has displayed all
the columns.
To see dimension of the matrix, you can use dim function.
To see correlation of the matrix, you can use cor function.
You can use subscripts to identify rows or columns.
4. Arrays
Arrays are similar to matrices but can have more than two dimensions.
5. Data Frames
A data frame is similar to SAS and SPSS datasets. It contains variables and records.
It is more general than a matrix, in that different columns can have different modes
(numeric, character, factor, etc.
The data.frame function is used to combine variables (vectors and factors)
into a data frame.
6. Lists
A list allows you to store a variety of objects.
You can use subscripts to select the specific component of the list.
How to know data type of a column

1. 'class' is a property assigned to an object that determines how generic functions
operate with it. It is not a mutually exclusive classification.
2. 'mode' is a mutually exclusive classification of objects according to their basic
structure. The 'atomic' modes are numeric, complex, charcter and logical.
> x <- 1:16
> x <- factor(x)
> class(x)
[1] "factor"
> mode(x)
[1] "numeric"
R : CREATE DUMMY DATA

This tutorial explains how to create dummy data.
Method 1 : Enter Data Manually

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
w = c('a', 'b', 'c', 'd', 'e'),
x = c(1, 1, 0, 0, 1))
Method 2 : Sequence of numbers, letters,

months and random numbers
1.
seq(1, 16, by=2) - sequence of numbers from 1 to 16 increment by
2.
2.
LETTERS[1:8] - the 8 upper-case letters of the english alphabet.
3.
month.abb[1:8] - the three-letter abbreviations for the first 8 English
months
4.
sample(10:20, 8, replace = TRUE) - 8 random numbers with
replacement from 10 to 20.
5.
letters[1:8] - the 8 lower-case letters of the english alphabet.
df2 <- data.frame(a = seq(1,16,by=2), b = LETTERS[1:8], x=
month.abb[1:8], y = sample(10:20,8, replace = TRUE), z=letters[1:8])
Method 3 : Create numeric grouping variable

df3 = data.frame(X = sample(1:3, 15, replace = TRUE))
It returns 15 random values with replacement from 1 to 3.
Method 4 : Random Numbers with mean 0

and std. dev 1
set.seed(1)
df4 <- data.frame(Y = rnorm(15), Z = ceiling(rnorm(15)))
Method 5 : Create binary variable (0/1)

set.seed(1)
ifelse(sign(rnorm(15))==-1,0,1)
In the code above, if sign of a random number is negative, it returns 0.
Otherwise, 1.
Method 6: Copy Data from Excel to R

Method 7: Create character grouping variable
mydata = sample(LETTERS[1:5],16,replace = TRUE)
It returns random 16 characters having alphabets ranging from "A" to "E".
IMPORTING DATA INTO R

Deepanshu Bhalla Add Comment R Tutorial
This tutorial explains how to get external data into R. It describes how to load data
from various sources such as CSV, text, excel. SAS or SPSS.
Importing Data in R
Loading data into the tool is one of the initial step of any project. If you have just
started using R, you would soon need to read in data from other sources.
Read Data into R
1. Reading a comma-delimited
text file (CSV)
If you don't have the names of the variables
in the first row
mydata <- read.csv("c:/mydata.csv", header=FALSE)
Note : R uses forward slash instead of backward slash in filename

Important Note : BIG CSV Files should be imported with fread
function of data.table.
library(data.table)
mydata = fread("c:/mydata.csv")
If you have the header row in the first row

mydata <- read.csv("c:/mydata.csv", header=TRUE)
If you want to set any value to a missing

value
mydata <- read.csv("c:/mydata.csv", header=TRUE, na.strings="."))
In this case, we have set "." (without quotes) to a missing value
If you want to set multiple values to missing

values
mydata <- read.csv("c:/mydata.csv", header=TRUE, na.strings= c("A" , "B" ))
In this case, we have set "A" and "B" (without quotes) to missing values
2. Reading a tab-delimited text

file
If you don't have the names (headers) in the
first row
mydata <- read.table("c:/mydata.txt")
Note : R uses forward slash instead of backward slash in filename

If you have the names (headers) in the first
row
mydata <- read.table("c:/mydata.txt", header=TRUE)
If you want to set any value to a missing

value
mydata <- read.table("c:/mydata.txt", header=TRUE, na.strings="."))
In this case, we have set "." (without quotes) to a missing value
If you want to set multiple values to missing

values
mydata <- read.table("c:/mydata.txt", header=TRUE, na.strings= c("A" , "B" ))
In this case, we have set "A" and "B" (without quotes) to missing values
3. Reading Excel File

The best way to read an Excel file is to save it to a CSV format and import it using
the CSV method
mydata <- read.csv("c:/mydata.csv", header=TRUE .
Step 1 : Install the package once

install.packages("readxl")
Step 2 : Define path and sheet name in the

code below
library(readxl)
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")
# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)
# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na = "NA")
4. Reading SAS File

install.packages("haven")
Step 2 : Define path in the code below

library("haven")
read_sas("c:/mydata.sas7bdat")
5. Reading SPSS File

install.packages("haven")

library("haven")
read_spss("c:/mydata.sav")
6. Load Data from R

load("mydata.RData")
EXPORTING DATA IN R
Exporting Data in R
Exporting Data with R
1. Writing comma-delimited text

file (CSV)
write.csv(mydata,"C:/Users/Deepanshu/Desktop/test.csv")
2. Writing tab-delimited text file

write.table(mydata, "C:/Users/Deepanshu/Desktop/test.txt", sep="\t")
3. Writing Excel File

install.packages("xlsReadWrite")
Step 2 : Define path and sheet name in the

code below
library(xlsReadWrite)
write.xls(mydata, "c:/mydata.xls")
4. Writing SAS File

install.packages("foreign")

library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas", package="SAS")
5. Writing SPSS File

install.packages("foreign")

library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sps", package="SPSS")
COPY DATA FROM EXCEL TO R
Deepanshu Bhalla 2 Comments R Tutorial
This tutorial explains how we can create data in MS Excel and paste it to R syntax
editor window to create a table in R. MS Excel is one of the most frequently used
tools in analytics industry. Many companies have switched their core analytics work
from Excel to R / SAS but we still find some of the data in excel file.
If you prefer to import excel file to R rather than copying and paste excel data to R,
you can check out this tutorial -
Importing Data to R
Step 1 : Prepare Data in Excel

Step 2 : Paste Data under text= "
" section of read,table (See the image below)
Prepare Data in Excel and Paste it to R Editor
data = read.table(text="
XYZ
650
6 3 NA
615
8 5 3", header=TRUE)
It creates 3 columns - X, Y and Z. The header = TRUE tells R to consider
first row as header.
READING AND SAVING DATA FILE IN R SESSION

Suppose you want to save an individual object in R and read it later.
Saving data file in R session

saveRDS(mydata, "logistic.rds")
Reading stored data from R session

mydata = readRDS("logistic.rds")
Note : You can define any name other than mydata.
Another way : Saving data file in R session

save (mydata,file="E:\\logistic.rdata")
Loading stored data from R session

load("E:\\logistic.rdata", ex <- new.env())
ls(ex)
Saving multiple objects in R session

save(mydata, data2, file="1.RData")
Saving everything in R session

save.image(file="1.RData")
DATA EXPLORATION WITH R

This article demonstrates how to explore data with R. It is very important to explore
data before starting to build a predictive model. It gives an idea about the structure
of the dataset like number of continuous or categorical variables and number of
observations (rows).
Dataset
The snapshot of the dataset used in this tutorial is pasted below. We have five
variables - Q1, Q2, Q3, Q4 and Age. The variables Q1-Q4 represents survey
responses of a questionnaire. The response lies between 1 and 6. The variable Age
represents age groups of the respondents. It lies between 1 to 3. 1 represents
Generation Z, 2 represents Generation X and Y, 3 represents Baby Boomers.
Sample Data
Import data into R

The read.csv() function is used to import CSV file into R. The header = TRUE tells R
that header is included in the data that we are going to import.

mydata <- read.csv("C:/Users/Deepanshu/Documents/Book1.csv", header=TRUE)
1. Calculate basic descriptive

statistics
summary(mydata)
Data Exploration with R
To calculate summary of a particular column, say third column, you can use the
following syntax :
summary( mydata[3])
To calculate summary of a particular column by its name, you can use the following
syntax :
summary( mydata$Q1)
2. Lists name of variables in a

dataset
names(mydata)
3. Calculate number of rows in a

dataset
nrow(mydata)
4. Calculate number of columns in

a dataset
ncol(mydata)
5. List structure of a dataset

str(mydata)
6. See first 6 rows of dataset

head(mydata)
7. First n rows of dataset
In the code below, we are selecting first 5 rows of dataset.

head(mydata, n=5)
8. All rows but the last row

head(mydata, n= -1)
9. Last 6 rows of dataset

tail(mydata)
10. Last n rows of dataset

In the code below, we are selecting last 5 rows of dataset.
tail(mydata, n=5)
11. All rows but the first row

tail(mydata, n= -1)
12. Number of missing values

The function below returns number of missing values in each variable of a dataset.
colSums(is.na(mydata))
13. Number of missing values in a

single variable
sum(is.na(mydata$Q1))
DATA MANIPULATION WITH R

This tutorial covers how to execute most frequently used data manipulation tasks
with R. It includes various examples with datasets and code. This tutorial is
designed for beginners who are very new to R programming language. It gives you
a quick look at several functions used in R.
1. Replacing / Recoding values

By 'recoding', it means replacing existing value(s) with the new value(s).
Create Dummy Data

mydata = data.frame(State = ifelse(sign(rnorm(25))==-1,'Delhi','Goa'), Q1=
sample(1:25))
In this example, we are replacing 1 with 6 in Q1 variable
mydata$Q1[mydata$Q1==1] <- 6
In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need
to convert the variable from factor to character.
mydata$State = as.character(mydata$State)
mydata$State[mydata$State=='Delhi'] <- 'Mumbai'
In this example, we are replacing 2 and 3 with NA values in whole dataset.
mydata[mydata == 2 | mydata == 3] <- NA
Another method
You have to first install the car package.
# Install the car package
install.packages("car")
# Load the car package
library("car")
# Recode 1 to 6
mydata$Q1 <- recode(mydata$Q1, "1=6")
Recoding a given range

# Recoding 1 through 4 to 0 and 5 and 6 to 1
mydata$Q1 <- recode(mydata$Q1, "1:4=0; 5:6=1")
You don't need to specify lowest and highest value of a range.

The lo keyword tells recode to start the range at the lowest value.
The hi keyword tells recode to end the range at the highest value.
# Recoding lowest value through 4 to 0 and 5 to highest value to 1
mydata$Q1 <- recode(mydata$Q1, "lo:4=0; 5:hi=1")
You can specify else condition in the recode statement. It means how to
treat remaining values that was not already recoded.
# Recoding lowest value through 4 to 0, 5 and 6 to 1, remaining values to 3,
mydata$Q1 <- recode(mydata$Q1, "lo:4=0; 5:6=1;else = 3")
2. Recoding to a new column

# Create a new column called Ques1
mydata$Ques1<- recode(mydata$Q1, "1:4=0; 5:6=1")
Note : Make sure you have installed and loaded "car" package before running the
above syntax.
How to use IF ELSE Statement

Sample Data
samples = data.frame(x =c(rep(1:10)), y=letters[1:10])
If a value of variable x is greater than 6, create a new variable called t1
and write 2 against the corresponding values else make it 1.
samples$t1 = ifelse(samples$x>6,2,1)
How to use AND Condition
samples$t3 = ifelse(samples$x>1 & samples$y=="b" ,2,1)
How to use NESTED IF ELSE Statement
samples$t4 = ifelse(samples$x>=1 & samples$x<=4,1,ifelse(samples$x>=5 &
samples$x<=7,2,3))
3. Renaming variables
To rename variables, you have to first install the dplyr package.
# Install the plyr package
install.packages("dplyr")
# Load the plyr package

library(dplyr)
# Rename Q1 variable to var1

mydata <- rename(mydata, var1 = Q1)
4. Keeping and Dropping Variables
In this example, we keep only first two variables .

mydata1 <- mydata[1:2]
In this example, we keep first and third through sixth variables .
mydata1 <- mydata[c(1,3:6)]
In this example, we select variables using their names such as v1, v2, v3.
newdata <- mydata[c("v1", "v2", "v3")]
Deleting a particular column (Fifth column)

mydata [-5]
Dropping Q3 variable
mydata$Q3 <- NULL
Deleting multiple columns

mydata [-(3:4) ]
Dropping multiple variables by their names

df = subset(mydata, select = -c(x,z) )
5. Subset data (Selecting Observations)

By 'subsetting' data, it implies filtering rows (observations).
Create Sample Data
mydata = data.frame(Name = ifelse(sign(rnorm(25))==-1,'ABC','DEF'), age =
sample(1:25))
Selecting first 10 observations
newdata <- mydata[1:10,]
Selecting values wherein age is equal to 3
mydata<-subset(mydata, age==3)
Copy data into a new data frame called 'newdata'
newdata<-subset(mydata, age==3)
Conditional Statement (AND) while selecting observations
newdata<-subset(mydata, Name=="ABC" & age==3)
Conditional Statement (OR) while selecting observations
newdata<-subset(mydata, Name=="ABC" | age==3)
Greater than or less than expression
newdata<-subset(mydata, age>=3)
Keeping only missing records
newdata<-subset(mydata, is.na(age))
Keeping only non-missing records
newdata<-subset(mydata, !is.na(age))
6. Sorting
Sorting is one of the most common data manipulation task. It is generally used
when we want to see the top 5 highest / lowest values of a variable.
Sorting a vector
x= sample(1:50)
x = sort(x, decreasing = TRUE)
The function sort() is used for sorting a 1 dimensional vector. It cannot be used for
more than 1 dimensional vector.
Sorting a data frame

mydata = data.frame(Gender = ifelse(sign(rnorm(25))==-1,'F','M'), SAT=
sample(1:25))
Sort gender variable in ascending order
mydata.sorted <- mydata[order(mydata$Gender),]
Sort gender variable in ascending order and then SAT in descending order
mydata.sorted1 <- mydata[order(mydata$Gender, -mydata$SAT),]
Note : "-" sign before mydata$SAT tells R to sort SAT variable in descending order.
7. Value labeling
Use factor() for nominal data
mydata$Gender <- factor(mydata$Gender, levels = c(1,2), labels = c("male",
"female"))
Use ordered() for ordinal data
mydata$var2 <- ordered(mydata$var2, levels = c(1,2,3,4), labels = c("Strongly

agree", "Somewhat agree", "Somewhat disagree", "Strongly disagree"))
8. Dealing with missing data
Number of missing values in a variable

colSums(is.na(mydata))
Number of missing values in a row

rowSums(is.na(mydata))
List rows of data that have missing values

mydata[!complete.cases(mydata),]
Creating a new dataset without missing data

mydata1 <- na.omit(mydata)
Convert a value to missing

mydata[mydata$Q1==999,"Q1"] <- NA
9. Aggregate by groups
The following code calculates mean for variable "x" by grouped variable "y".
samples = data.frame(x =c(rep(1:10)), y=round((rnorm(10))))
mydata <- aggregate(x~y, samples, mean, na.rm = TRUE)
10. Frequency for a vector
To calculate frequency for State vector, you can use table() function.
11. Merging (Matching)

It merges only common cases to both datasets.
mydata <- merge(mydata1, mydata2, by=c("ID"))
Detailed Tutorial : Joining and Merging
12. Removing Duplicates

XYZ
650
650
615
853
1 NA 1
872
In the example below, we are removing duplicates in a whole data set. [Equivalent
to NODUP in SAS]
mydata1 <- unique(data)

In the example below, we are removing duplicates by "Y" column. [Equivalent
to NODUPKEY in SAS]
mydata2 <- subset(data, !duplicated(data[,"Y"]))
13. Combining Columns and Rows

If the columns of two matrices have the same number of rows, they can be
combined into a larger matrix using cbind function. In the example below, A and B
are matrices.
newdata<- cbind(A, B)
Similarly, we can combine the rows of two matrices if they have the same number
of columns with the rbind function. In the example below, A and B are matrices.
newdata<- rbind(A, B)
14. Combining Rows when different set of columns
The function rbind() does not work when the column names do not match in the two
datasets. For example, dataframe1 has 3 column A B and C . dataframe2 also has 3
columns A D E. The function rbind() throws an error. The function smartbind() from
gtools would combine column A and returns NAs where column names do not
match.
install.packages("gtools") #If not installed
library(gtools)
mydata <- smartbind(mydata1, mydata2)
Next Step :
Learn Data Manipulation with dplyr Package
DPLYR TUTORIAL (WITH 50 EXAMPLES)

Deepanshu Bhalla 5 Comments dplyr, R Tutorial
It's a complete tutorial on data wrangling or manipulation with R. This tutorial

covers one of the most powerful R package for data wrangling i.e. dplyr. This
package was written by the most popular R programmer Hadley Wickham who has
written many useful R packages such as ggplot2, tidyr etc. It's one of the most
popular R package as of date. This post includes several examples and tips of how
to use dply package for cleaning and transforming data.
What is dplyr?
dplyr is a powerful R-package to manipulate, clean and summarize unstructured

data. In short, it makes data exploration and data manipulation easy and fast in R.
What's special about dplyr?

The package "dplyr" comprises many functions that perform mostly used data
manipulation operations such as applying filter, selecting specific columns, sorting
data, adding or deleting columns and aggregating data. Another most important
advantage of this package is that it's very easy to learn and use dplyr functions.
Also easy to recall these functions. For example, filter() is used to filter rows.
dplyr Tutorial
dplyr vs. Base R Functions

dplyr functions process faster than base R functions. It is because dplyr functions
were written in a computationally efficient manner. They are also more stable in the
syntax and better supports data frames than vectors.
SQL Queries vs. dplyr

People have been utilizing SQL for analyzing data for decades. Every modern data
analysis software such as Python, R, SAS etc supports SQL commands. But SQL was
never designed to perform data analysis. It was rather designed for querying and
managing data. There are many data analysis operations where SQL fails or makes
simple things difficult. For example, calculating median for multiple variables,
converting wide format data to long format etc. Whereas, dplyr package was
designed to do data analysis.
The names of dplyr functions are similar to SQL commands such as select() for
selecting variables, group_by() - group data by grouping variable, join() - joining
two data sets. Also includes inner_join() and left_join(). It also supports sub
queries for which SQL was popular for.
How to install and load dplyr

package
To install the dplyr package, type the following command.
install.packages("dplyr")
To load dplyr package, type the command below
library(dplyr)
Important dplyr Functions to

remember
dplyr Function
Description
Equivalent SQL
select()
Selecting columns (variables)
SELECT
filter()
Filter (subset) rows.
WHERE
group_by()
Group the data
GROUP BY
summarise()
Summarise (or aggregate) data
arrange()
Sort the data
ORDER BY
join()
Joining data frames (tables)
JOIN
mutate()
Creating New Variables
COLUMN ALIAS
Data : Income Data by States

In this tutorial, we are using the following data which contains income generated by
states from year 2002 to 2015. Note : This data do not contain actual income
figures of the states.
This dataset contains 51 observations (rows) and 16 variables (columns). The
snapshot of few rows and columns of the dataset is shown below.
Input Dataset
Download the Dataset
How to load Data
Submit the following code. Change the file path in the code below.
mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")
Example 1 : Selecting Random N

Rows
The sample_n function selects random rows from a data frame (or table). The
second parameter of the function tells R the number of rows to select.
sample_n(mydata,3)
Example 2 : Selecting Random

Fraction of Rows
The sample_frac function returns randomly N% of rows. In the example below, it
returns randomly 10% of rows.
sample_frac(mydata,0.1)
Example 3 : Remove Duplicate

Rows based on all the variables
(Complete Row)
The distinct function is used to eliminate duplicates.
x1 = distinct(mydata)
Example 4 : Remove Duplicate
Rows based on a variable

The .keep_all function is used to retain all other variables in the output data frame.
x2 = distinct(mydata, Index, .keep_all= TRUE)
Example 5 : Remove Duplicates

Rows based on multiple variables
In the example below, we are using two variables - Index, Y2010 to determine
uniqueness.
x2 = distinct(mydata, Index, Y2010, .keep_all= TRUE)
select( ) Function
It is used to select only desired variables.
select() syntax : select(data , ....)
data : Data Frame
.... : Variables by name or by function
Example 6 : Selecting Variables

(or Columns)
Suppose you are asked to select only a few variables. The code below selects
variables "Index", columns from "State" to "Y2008".
mydata2 = select(mydata, Index, State:Y2008)
Example 7 : Dropping Variables

The minus sign before a variable tells R to drop the variable.
mydata = select(mydata, -Index, -State)
The above code can also be written like :
mydata = select(mydata, -c(Index,State))
Example 8 : Selecting or Dropping

Variables starts with 'Y'
The starts_with() function is used to select variables starts with an alphabet.
mydata3 = select(mydata, starts_with("Y"))
Adding a negative sign before starts_with() implies dropping the variables starts
with 'Y'
mydata33 = select(mydata, -starts_with("Y"))
The following functions helps you to select

variables based on their names.
Helpers
Description
starts_with()
Starts with a prefix
ends_with()
Ends with a prefix
contains()
Contains a literal string
matches()
Matches a regular expression
num_range()
Numerical range like x01, x02, x03.
one_of()
Variables in character vector.
everything()
All variables.
Example 9 : Selecting Variables

contain 'I' in their names
mydata4 = select(mydata, contains("I"))
Example 10 : Reorder Variables

The code below keeps variable 'State' in the front and the remaining variables
follow that.
mydata5 = select(mydata, State, everything())
rename( ) Function
It is used to change variable name.
rename() syntax : rename(data , new_name = old_name)
data : Data Frame
new_name : New variable name you want to keep
old_name : Existing Variable Name
Example 11 : Rename Variables

The rename function can be used to rename variables.
In the following code, we are renaming 'Index' variable to 'Index1'.
mydata6 = rename(mydata, Index1=Index)
Output
filter( ) Function
It is used to subset data with matching logical conditions.
filter() syntax : filter(data , ....)
data : Data Frame
.... : Logical Condition
Example 12 : Filter Rows

Suppose you need to subset data. You want to filter rows and retain only those
values in which Index is equal to A.
mydata7 = filter(mydata, Index == "A")
Example 13 : Multiple Selection

Criteria
The %in% operator can be used to select multiple items. In the following program,
we are telling R to select rows against 'A' and 'C' in column 'Index'.
mydata7 = filter(mydata6, Index %in% c("A", "C"))
Example 14 : 'AND' Condition in

Selection Criteria
Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A'
and 'C' in the column 'Index' and income greater than 1.3 million in Year 2002.
mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )
Example 15 : 'OR' Condition in

Selection Criteria
The 'I' denotes OR in the logical condition. It means any of the two conditions.
mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)
Example 16 : NOT Condition

The "!" sign is used to reverse the logical condition.
mydata10 = filter(mydata6, !Index %in% c("A", "C"))
Example 17 : CONTAINS Condition

The grepl function is used to search for pattern matching. In the following code,
we are looking for records wherein column state contains 'Ar' in their name.
mydata10 = filter(mydata6, grepl("Ar", State))
summarise( ) Function
It is used to summarize data.
summarise() syntax : summarise(data , ....)
data : Data Frame
..... : Summary Functions such as mean, median etc
Example 18 : Summarize selected

variables
In the example below, we are calculating mean and median for the variable Y2015.
summarise(mydata, Y2015_mean = mean(Y2015), Y2015_med=median(Y2015))
Output
Example 19 : Summarize Multiple

Variables
In the following example, we are calculating number of records, mean and median
for variables Y2005 and Y2006. The summarise_at function allows us to select
multiple variables by their names.
summarise_at(mydata, vars(Y2005, Y2006), funs(n(), mean, median))
Output
Example 20 : Summarize with

Custom Functions
We can also use custom functions in the summarise function. In this case, we are
computing the number of records, number of missing values, mean and median for
variables Y2011 and Y2012. The dot (.) denotes each variables specified in the
second argument of the function.
summarise_at(mydata, vars(Y2011, Y2012),
funs(n(), missing = sum(is.na(.)), mean(., na.rm = TRUE), median(.,na.rm =
TRUE)))
Summarize : Output
Example 21 : Summarize all

Numeric Variables
First, store data for all the numeric variables
numdata = mydata[sapply(mydata,is.numeric)]
Second, the summarise_all function calculates summary statistics for all the
columns in a data frame
summarise_all(numdata, funs(n(),mean,median))
Example 22 : Summarize Factor

Variable
We are checking the number of levels/categories and count of missing
observations in a categorical (factor) variable.
summarise_all(mydata["Index"], funs(nlevels(.), sum(is.na(.))))
arrange() function :
Use : Sort data
Syntax
arrange(data_frame, variable(s)_to_sort)
or
data_frame %>% arrange(variable(s)_to_sort)
To sort a variable in descending order, use desc(x).
Example 23 : Sort Data by

Multiple Variables
The default sorting order of arrange() function is ascending. In this example, we
are sorting data by multiple variables.
arrange(mydata, Index, Y2011)
Suppose you need to sort one variable by descending order and other variable by
ascending oder.
arrange(mydata, desc(Index), Y2011)
Pipe Operator %>%

It is important to understand the pipe (%>%) operator before knowing the other
functions of dplyr package. dplyr utilizes pipe operator from another
package (magrittr).
It allows you to write sub-queries like we do it in sql.
Note : All the functions in dplyr package can be used without the pipe operator.
The question arises "Why to use pipe operator %>%". The answer is it
lets to wrap multiple functions together with the use of %>%.
Syntax :
filter(data_frame, variable == value)
or
data_frame %>% filter(variable == value)
The %>% is NOT restricted to filter function. It can be used with any
function.
Example :
The code below demonstrates the usage of pipe %>% operator. In this example, we
are selecting 10 random observations of two variables "Index" "State" from the data
frame "mydata".
dt = sample_n(select(mydata, Index, State),10)
or
dt = mydata %>% select(Index, State) %>% sample_n(10)
Output
group_by() function :
Use : Group data by categorical variable
Syntax :
group_by(data, variables)
or
data %>% group_by(variables)
Example 24 : Summarise Data by

Categorical Variable
We are calculating count and mean of variables Y2011 and Y2012 by variable Index.
t = summarise_at(group_by(mydata, Index), vars(Y2011, Y2012), funs(n(), mean(.,

na.rm = TRUE)))
The above code can also be written like
t = mydata %>% group_by(Index) %>%
summarise_at(vars(Y2011:Y2015), funs(n(), mean(., na.rm = TRUE)))
do() function :
Use : Compute within groups

Syntax :
do(data_frame, expressions_to_apply_to_each_group)
Note : The dot (.) is required to refer to a data frame.
Example 25 : Filter Data within a

Categorical Variable
Suppose you need to pull top 2 rows from 'A', 'C' and 'I' categories of variable
Index.
t = mydata %>% filter(Index %in% c("A", "C","I")) %>% group_by(Index) %>%
do(head( . , 2))
Output : do() function
Example 26 : Selecting 3rd

Maximum Value by Categorical
Variable
We are calculating third maximum value of variable Y2015 by variable Index. The
following code first selects only two variables Index and Y2015. Then it filters the
variable Index with 'A', 'C' and 'I' and then it groups the same variable and sorts the
variable Y2015 in descending order. At last, it selects the third row.
t = mydata %>% select(Index, Y2015) %>%
filter(Index %in% c("A", "C","I")) %>%
group_by(Index) %>%
do(arrange(.,desc(Y2015))) %>% slice(3)
The slice() function is used to select rows by

position.
Output
Using Window Functions

Like SQL, dplyr uses window functions that are used to subset data within a group. It
returns a vector of values. We could use min_rank() function that calculates rank
in the preceding example,
t = mydata %>% select(Index, Y2015) %>%
filter(Index %in% c("A", "C","I")) %>%
group_by(Index) %>%
filter(min_rank(desc(Y2015)) == 3)
Example 27 : Summarize, Group

and Sort Together
In this case, we are computing mean of variables Y2014 and Y2015 by variable
Index. Then sort the result by calculated mean variable Y2015.
t = mydata %>%
group_by(Index)%>%
summarise(Mean_2014 = mean(Y2014, na.rm=TRUE),
Mean_2015 = mean(Y2015, na.rm=TRUE)) %>%
arrange(desc(Mean_2015))
mutate() function :
Use : Creates new variables

Syntax :
mutate(data_frame, expression(s) )
or
data_frame %>% mutate(expression(s))
Example 28 : Create a new

variable
The following code calculates division of Y2015 by Y2014 and name it "change".
mydata1 = mutate(mydata, change=Y2015/Y2014)
Example 29 : Multiply all the

variables by 1000
It creates new variables and name them with suffix "_new".
mydata11 = mutate_all(mydata, funs("new" = .* 1000))
Output
The output shown in the image above is truncated due to high number of variables.
Example 30 : Calculate Rank for

Variables
Suppose you need to calculate rank for variables Y2008 to Y2010.

mydata12 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(.)))
Output
By default, min_rank() assigns 1 to the smallest value and high number to the
largest value. In case, you need to assign rank 1 to the largest value of a variable,
use min_rank(desc(.))
mydata13 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(desc(.))))
Example 31 : Select State that

generated highest income among
the variable 'Index'
out = mydata %>% group_by(Index) %>% filter(min_rank(desc(Y2015)) == 1) %>
%
select(Index, Y2015)
Example 32 : Cumulative Income

of 'Index' variable
The cumsum function calculates cumulative sum of a variable. With mutate
function, we insert a new variable called 'Total' which contains values of

cumulative income of variable Index.
out2 = mydata %>% group_by(Index) %>% mutate(Total=cumsum(Y2015)) %>%
select(Index, Y2015, Total)
join() function :
Use : Join two datasets
Syntax :
inner_join(x, y, by = )
left_join(x, y, by = )
right_join(x, y, by = )
full_join(x, y, by = )
semi_join(x, y, by = )
anti_join(x, y, by = )
x, y - datasets (or tables) to merge / join
by - common variable (primary key) to join by.
Example 33 : Common rows in

both the tables
Let's create two data frames say df1 and df2.
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
w = c('a', 'b', 'c', 'd', 'e'),
x = c(1, 1, 0, 0, 1),
y=rnorm(5),
z=letters[1:5])
df2 <- data.frame(ID = c(1, 7, 3, 6, 8),
a = c('z', 'b', 'k', 'd', 'l'),
b = c(1, 2, 3, 0, 4),
c =rnorm(5),
d =letters[2:6])
INNER JOIN returns rows when there is a match in both tables. In this example, we
are merging df1 and df2 with ID as common variable (primary key).
df3 = inner_join(df1, df2, by = "ID")
Output : INNER JOIN
If the primary key does not have same name in both the tables, try the following
way:
inner_join(df1, df2, by = c("ID"="ID1"))
Example 34 : Applying LEFT JOIN

LEFT JOIN : It returns all rows from the left table, even if there are no matches in
the right table.
left_join(df1, df2, by = "ID")
Output : LEFT JOIN
Combine Data Vertically

intersect(x, y)
Rows that appear in both x and y.
union(x, y)
Rows that appear in either or both x and y.
setdiff(x, y)
Rows that appear in x but not y.
Example 35 : Applying INTERSECT

Prepare Sample Data for Demonstration
mtcars$model <- rownames(mtcars)
first <- mtcars[1:20, ]
second <- mtcars[10:32, ]
INTERSECT selects unique rows that are common to both the data frames.
intersect(first, second)
Example 36 : Applying UNION

UNION displays all rows from both the tables and removes duplicate records from
the combined dataset. By using union_all function, it allows duplicate rows in the
combined dataset.
x=data.frame(ID = 1:6, ID1= 1:6)

y=data.frame(ID = 1:6, ID1 = 1:6)
union(x,y)
union_all(x,y)
Example 37 : Rows appear in one

table but not in other table
setdiff(first, second)
Example 38 : IF ELSE Statement

Syntax :
if_else(condition, true, false, missing = NULL)
true : Value if condition meets
false : Value if condition does not meet
missing : If not NULL, will be used to replace missing values
df <- c(-10,2, NA)
if_else(df < 0, "negative", "positive", missing = "missing value")
Create a new variable with IF_ELSE

If a value is less than 5, add it to 1 and if it is greater than or equal to 5, add it to 2.
Otherwise 0.
df =data.frame(x = c(1,5,6,NA))
df$newvar = if_else(df$x<5, df$x+1, df$x+2,0)
Output
Example 39 : Apply ROW WISE

Operation
Suppose you want to find maximum value in each row of variables 2012, 2013,
2014, 2015. The rowwise() function allows you to apply functions to rows.
df = mydata %>%
rowwise() %>% mutate(Max= max(Y2012:Y2015)) %>%
select(Y2012:Y2015,Max)
Output
Example 40 : Combine Data

Frames
Suppose you are asked to combine two data frames. Let's first create two sample
datasets.
df1=data.frame(ID = 1:6, x=letters[1:6])
df2=data.frame(ID = 7:12, x=letters[7:12])
Input Datasets
The bind_rows() function combine two datasets with rows. So combined dataset
would contain 12 rows (6+6) and 2 columns.
xy = bind_rows(df1,df2)
It is equivalent to base R function rbind.
xy = rbind(df1,df2)
The bind_cols() function combine two datasets with columns. So combined
dataset would contain 4 columns and 6 rows.
xy = bind_cols(x,y)
or
xy = cbind(x,y)
The output is shown below-
cbind Output
Example 41 : Calculate Percentile

Values
The quantile() function is used to determine Nth percentile value. In this example,
we are computing percentile values by variable Index.
mydata %>% group_by(Index) %>%
summarise(Pecentile_25=quantile(Y2015, probs=0.25),
Pecentile_50=quantile(Y2015, probs=0.5),
Pecentile_75=quantile(Y2015, probs=0.75),
Pecentile_99=quantile(Y2015, probs=0.99))
The ntile() function is used to divide the data into N bins.

x= data.frame(N= 1:10)
x = mutate(x, pos = ntile(x$N,5))
Example 42 : Automate Model

Building
This example explains the advanced usage of do() function. In this example, we
are building linear regression model for each level of a categorical variable. There
are 3 levels in variable cyl of dataset mtcars.

length(unique(mtcars$cyl))
Result : 3
by_cyl <- group_by(mtcars, cyl)
models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
summarise(models, rsq = summary(mod)$r.squared)
models %>% do(data.frame(
var = names(coef(.$mod)),
coef(summary(.$mod)))
)
Output : R-Squared Values
Endnotes
There are hundreds of packages that are dependent on this package. The main
benefit it offers is to take off fear of R programming and make coding effortless and
lower processing time. However, some R programmers prefer data.table package
for its speed. I would recommend learn both the packages. Check out data.table
tutorial. The data.table package wins over dplyr in terms of speed if data size
greater than 1 GB.
Data Wrangling with data.table package

DATA.TABLE TUTORIAL (WITH 50 EXAMPLES)
Deepanshu Bhalla Add Comment data.table, R Tutorial
This tutorial describes how to manipulate data with data.table R package. It

is considered as the fastest R package for data wrangling. Analysts generally
call R programming not compatible with big datasets ( > 10 GB) as it is not
memory efficient and loads everything into RAM. To change their perception,
'data.table' package comes into play. This package was designed to be
concise and painless. There are many benchmarks done in the past to
compare dplyr vs data.table. In every benchmark, data.table wins. The
efficiency of this package was also compared with python' package (panda).
And data.table wins. In CRAN, there are more than 200 packages that are
dependent on data.table which makes it listed in the top 5 R's package. This
post includes various examples and practice questions to make you familiar
with the package.
data.table Tutorial
data.table Syntax
The syntax of data.table is shown in the image below :
data.table Syntax
DT[ i , j , by]
1.
The first parameter of data.table i refers to rows. It implies subsetting
rows. It is equivalent to WHERE clause in SQL
2.
The second parameter of data.table j refers to columns. It implies
subsetting columns (dropping / keeping). It is equivalent to SELECT clause in
SQL.
3.
The third parameter of data.table by refers to adding a group so that
all calculations would be done within a group. Equivalent to SQL's GROUP
BYclause.
The data.table syntax is NOT RESTRICTED to only 3
parameters. There are other arguments that can be added to data.table
syntax. The list is as follows 1.
with, which
2.
allow.cartesian
3.
roll, rollends
4.
.SD, .SDcols
5.
on, mult, nomatch
The above arguments would be explained in the latter part of the post.
How to Install and load data.table

Package
install.packages("data.table")
#load required library
library(data.table)
Read Data
In data.table package, fread() function is available to read or get data from
your computer or from a web page. It is equivalent to read.csv() function of
base R.
mydata = fread("https://github.com/arunsrinivasan/satrdaysworkshop/raw/master/flights_2014.csv")
Describe Data
This dataset contains 253K observations and 17 columns. It constitutes
information about flights' arrival or departure time, delays, flight cancellation
and destination in year 2014.

nrow(mydata)
[1] 253316
ncol(mydata)
[1] 17
names(mydata)
[1] "year"
"month"
"day"
"dep_time" "dep_delay" "arr_time"
"arr_delay"
[8] "cancelled" "carrier" "tailnum" "flight" "origin" "dest"
"air_time"
[15] "distance" "hour"
"min"
head(mydata)
year month day dep_time dep_delay arr_time arr_delay cancelled carrier
tailnum flight
1: 2014
1 1
914
14
1238
13
0
AA N338AA
1
2: 2014
1 1
1157
-3
1523
13
0
AA N335AA
3
3: 2014
1 1
1902
2
2224
9
0
AA N327AA
21
4: 2014
1 1
722
-8
1014
-26
0
AA N3EHAA
29
5: 2014
1 1
1347
2
1706
1
0
AA N319AA 117
6: 2014
1 1
1824
4
2145
0
0
AA N3DEAA 119
origin dest air_time distance hour min
1: JFK LAX
359
2475 9 14
2: JFK LAX
363
2475 11 57
3: JFK LAX
351
2475 19 2
4: LGA PBI
157
1035 7 22
5: JFK LAX
350
2475 13 47
6: EWR LAX
339
2454 18 24
Selecting or Keeping Columns

Suppose you need to select only 'origin' column. You can use the code below
dat1 = mydata[ , origin] # returns a vector
The above line of code returns a vector not data.table.
To get result in data.table format, run the code below :
dat1 = mydata[ , .(origin)] # returns a data.table
It can also be written like data.frame way
dat1 = mydata[, c("origin"), with=FALSE]
Keeping a column based on column position

dat2 =mydata[, 2, with=FALSE]
In this code, we are selecting second column from mydata.
Keeping Multiple Columns

The following code tells R to select 'origin', 'year', 'month', 'hour' columns.
dat3 = mydata[, .(origin, year, month, hour)]
Keeping multiple columns based on column

position
You can keep second through fourth columns using the code below dat4 = mydata[, c(2:4), with=FALSE]
Dropping a Column
Suppose you want to include all the variables except one column, say.
'origin'. It can be easily done by adding ! sign (implies negation in R)
dat5 = mydata[, !c("origin"), with=FALSE]
Dropping Multiple Columns

dat6 = mydata[, !c("origin", "year", "month"), with=FALSE]
Keeping variables that contain 'dep'

You can use %like% operator to find pattern. It is same as base R's grepl()
function, SQL's LIKE operator and SAS's CONTAINS function.
dat7 = mydata[,names(mydata) %like% "dep", with=FALSE]
Rename Variables
You can rename variables with setnames() function. In the following code,
we are renaming a variable 'dest' to 'destination'.
setnames(mydata, c("dest"), c("Destination"))

To rename multiple variables, you can simply add variables in both the sides.
setnames(mydata, c("dest","origin"), c("Destination", "origin.of.flight"))
Subsetting Rows / Filtering

Suppose you are asked to find all the flights
whose origin is 'JFK'.
# Filter based on one variable
dat8 = mydata[origin == "JFK"]
Select Multiple Values

Filter all the flights whose origin is either 'JFK' or 'LGA'
dat9 = mydata[origin %in% c("JFK", "LGA")]
Apply Logical Operator : NOT

The following program selects all the flights whose origin is not equal to 'JFK'
and 'LGA'
# Exclude Values
dat10 = mydata[!origin %in% c("JFK", "LGA")]
Filter based on Multiple variables

If you need to select all the flights whose origin is equal to 'JFK' and carrier =
'AA'
dat11 = mydata[origin == "JFK" & carrier == "AA"]
Faster Data Manipulation with

Indexing
data.table uses binary search algorithm that makes data manipulation
faster.
Binary Search Algorithm

Binary search is an efficient algorithm for finding a value from a sortedlist of
values. It involves repeatedly splitting in half the portion of the list that
contains values, until you found the value that you were searching for.
Suppose you have the following values in a variable :
5, 10, 7, 20, 3, 13, 26
You are searching the value 20 in the above list. See how binary search
algorithm works 1.
First, we sort the values
2.
We would calculate the middle value i.e. 10.
3.
We would check whether 20 = 10? No. 20 < 10.
4.
Since 20 is greater than 10, it should be somewhere after 10. So we
can ignore all the values that are lower than or equal to 10.
5.
We are left with 13, 20, 26. The middle value is 20.
6.
We would again check whether 20=20. Yes. the match found.
If we do not use this algorithm, we would have to search 5 in the whole list of
seven values.
It is important to set key in your dataset which tells system that data is
sorted by the key column. For example, you have employees name, address,
salary, designation, department, employee ID. We can use 'employee ID' as a
key to search a particular employee.
Set Key
In this case, we are setting 'origin' as a key in the dataset mydata.
# Indexing (Set Keys)
setkey(mydata, origin)
Note : It makes the data table sorted by the column 'origin'.
How to filter when key is turned on.

You don't need to refer the key column when you apply filter.
data12 = mydata[c("JFK", "LGA")]
Performance Comparison
You can compare performance of the filtering process (With or Without
KEY).
system.time(mydata[origin %in% c("JFK", "LGA")])
system.time(mydata[c("JFK", "LGA")])
Performance - With or without KEY
If you look at the real time in the image above, setting key makes filtering
twice as faster than without using keys.
Indexing Multiple Columns

We can also set keys to multiple columns like we did below to columns
'origin' and 'dest'. See the example below.
setkey(mydata, origin, dest)
Filtering while setting keys on Multiple

Columns
# First key column 'origin' matches JFK and second key column 'dest'
matches MIA
mydata[.("JFK", "MIA")]
It is equivalent to the following code :
mydata[origin == "JFK" & dest == "MIA"]
To identify the column(s) indexed by

key(mydata)
Result : It returns origin and dest as these are columns that are set keys.
Sorting Data
We can sort data using setorder() function, By default, it sorts data on
ascending order.
mydata01 = setorder(mydata, origin)
Sorting Data on descending order

In this case, we are sorting data by 'origin' variable on descending order.
mydata02 = setorder(mydata, -origin)
Sorting Data based on multiple variables

In this example, we tells R to reorder data first by origin on ascending order
and then variable 'carrier'on descending order.
mydata03 = setorder(mydata, origin, -carrier)
Adding Columns (Calculation on

rows)
You can do any operation on rows by adding := operator. In this example,
we are subtracting 'dep_delay' variable from 'dep_time' variable to compute
scheduled departure time.
mydata[, dep_sch:=dep_time - dep_delay]
Adding Multiple Columns

mydata002 = mydata[, c("dep_sch","arr_sch"):=list(dep_time - dep_delay,
arr_time - arr_delay)]
IF THEN ELSE
The 'IF THEN ELSE' conditions are very popular for recoding values. In
data.table package, it can be done with the following methods :
Method I : mydata[, flag:= 1*(min < 50)]
Method II : mydata[, flag:= ifelse(min < 50, 1,0)]
It means to set flag= 1 if min is less than 50. Otherwise, set flag =0.
How to write Sub Queries (like

SQL)
We can use this format - DT[ ] [ ] [ ] to build a chain in data.table. It is like
sub-queries like SQL.
mydata[, dep_sch:=dep_time - dep_delay][,.(dep_time,dep_delay,dep_sch)]
First, we are computing scheduled departure time and then selecting only
relevant columns.
Summarize or Aggregate Columns

Like SAS PROC MEANS procedure, we can generate summary statistics of
specific variables. In this case, we are calculating mean, median, minimum
and maximum value of variable arr_delay.
mydata[, .(mean = mean(arr_delay, na.rm = TRUE),
median = median(arr_delay, na.rm = TRUE),
min = min(arr_delay, na.rm = TRUE),
max = max(arr_delay, na.rm = TRUE))]
Summarize with data.table package
Summarize Multiple Columns

To summarize multiple variables, we can simply write all the summary
statistics function in a bracket. See the command belowmydata[, .(mean(arr_delay), mean(dep_delay))]
If you need to calculate summary statistics for a larger list of variables, you
can use .SD and .SDcols operators. The .SD operator implies 'Subset of
Data'.
mydata[, lapply(.SD, mean), .SDcols = c("arr_delay", "dep_delay")]
In this case, we are calculating mean of two variables - arr_delay and
dep_delay.
Summarize all numeric Columns

By default, .SD takes all continuous variables (excluding grouping variables)
mydata[, lapply(.SD, mean)]
Summarize with multiple statistics

mydata[, sapply(.SD, function(x) c(mean=mean(x), median=median(x)))]
GROUP BY (Within Group

Calculation)
Summarize by group 'origin
mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by = origin]
Summary by group
Use key column in a by operation

Instead of 'by', you can use keyby= operator.
mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), keyby=
origin]
Summarize multiple variables by group

'origin'
mydata[, .(mean(arr_delay, na.rm = TRUE), mean(dep_delay, na.rm =
TRUE)), by = origin]
Or it can be written like below mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay",
"dep_delay"), by = origin]
Remove Duplicates
You can remove non-unique / duplicate cases with unique() function. Suppose
you want to eliminate duplicates based on a variable, say. carrier.
setkey(mydata, "carrier")
unique(mydata)
Suppose you want to remove duplicated based on all the variables. You can
use the command below setkey(mydata, NULL)
unique(mydata)
Note : Setting key to NULL is not required if no key is already set.
Extract values within a group

The following command selects first and second values from a categorical
variable carrier.
mydata[, .SD[1:2], by=carrier]
Select LAST value from a group

mydata[, .SD[.N], by=carrier]
SQL's RANK OVER PARTITION

In SQL, Window functions are very useful for solving complex data problems.
RANK OVER PARTITION is the most popular window function. It can be easily
translated in data.table with the help of frank() function. frank() is similar to
base R's rank() function but much faster. See the code below.
dt = mydata[, rank:=frank(-distance,ties.method = "min"), by=carrier]
In this case, we are calculating rank of variable 'distance' by 'carrier'. We are
assigning rank 1 to the highest value of 'distance' within unique values of
'carrier'.
Cumulative SUM by GROUP

We can calculate cumulative sum by using cumsum() function.
dat = mydata[, cum:=cumsum(distance), by=carrier]
Lag and Lead

The lag and lead of a variable can be calculated with shift() function. The
syntax of shift() function is as follows - shift(variable_name,
number_of_lags, type=c("lag", "lead"))
DT <- data.table(A=1:5)
DT[ , X := shift(A, 1, type="lag")]
DT[ , Y := shift(A, 1, type="lead")]
Lag and Lead Function
Between and LIKE Operator

We can use %between% operator to define a range. It is inclusive of the
values of both the ends.
DT = data.table(x=6:10)
DT[x %between% c(7,9)]
The %like% is mainly used to find all the values that matches a pattern.
DT = data.table(Name=c("dep_time","dep_delay","arrival"), ID=c(2,3,4))
DT[Name %like% "dep"]
Merging / Joins
The merging in data.table is very similar to base R merge() function. The

only difference is data.table by default takes common key variable as a
primary key to merge two datasets. Whereas, data.frame takes common
variable name as a primary key to merge the datasets.
Sample Data
(dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A"))
(dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A"))
Inner Join
It returns all the matching observations in both the datasets.

merge(dt1, dt2, by="A")
Left Join
It returns all observations from the left dataset and the matched
observations from the right dataset.
merge(dt1, dt2, by="A", all.x = TRUE)
Right Join
It returns all observations from the right dataset and the matched
observations from the left dataset.
merge(dt1, dt2, by="A", all.y = TRUE)
Full Join
It return all rows when there is a match in one of the datasets.
merge(dt1, dt2, all=TRUE)
Convert a data.table to
data.frame
You can use setDF() function to accomplish this task.
setDF(mydata)
Similarly, you can use setDT() function to convert data frame to data table.
set.seed(123)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE)
setDT(X, key = "A")
Other Useful Functions

Reshape Data
It includes several useful functions which makes data cleaning easy and
smooth. To reshape or transpose data, you can
use dcast.data.table()and melt.data.table() functions. These functions
are sourced from reshape2 package and make them efficient. It also add
some new features in these functions.
Rolling Joins
It supports rolling joins. They are commonly used for analyzing time series
data. A very R packages supports these kind of joins.
Examples for Practise

Q1. Calculate total number of rows by month and
then sort on descending order
mydata[, .N, by = month] [order(-N)]
The .N operator is used to find count.
Q2. Find top 3 months with high mean arrival delay

mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by = month]
[order(-mean_arr_delay)][1:3]
Q3. Find origin of flights having average total delay

is greater than 20 minutes
mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay",

"dep_delay"), by = origin][(arr_delay + dep_delay) > 20]
Q4. Extract average of arrival and departure

delays for carrier == 'DL' by 'origin' and 'dest'
variables
mydata[carrier == "DL",
lapply(.SD, mean, na.rm = TRUE),
by = .(origin, dest),
.SDcols = c("arr_delay", "dep_delay")]
Q5. Pull first value of 'air_time' by 'origin' and then

sum the returned values when it is greater than
300
mydata[, .SD[1], .SDcols="air_time", by=origin][air_time > 300,
sum(air_time)]
Endnotes
This package provides a one-stop solution for data wrangling in R. It offers
two main benefits - less coding and lower computing time. However, it's not
a first choice of some of R programmers. Some prefer dplyr package for its
simplicity. I would recommend learn both the packages. Check out dplyr
tutorial. If you are working on data having size less than 1 GB, you can use
dplyr package. It offers decent speed but slower than data.table package.
TRANSPOSE DATA IN R
In R, we can transpose our data very easily. In R, there are many packages
such as tidyr and reshape2 that helps to make it easy. In this article, i would
use 'reshape2' package. This package was written by the most popular R
expert Hadley Wickham.
Transpose Data with R
Let's go through some examples -
Sample Data
The code below would create a sample data that would be used for
demonstration.
data <- read.table(text="X Y Z
ID12 2012-06 566
ID1 2012-06 10239
ID6 2012-06 524
ID12 2012-07 2360
ID1 2012-07 13853
ID6 2012-07 2352
ID12 2012-08 3950
ID1 2012-08 14738
ID6 2012-08 4104",header=TRUE)
Convert Long to Wide Format
Suppose you have data containing three variables such as X, Y and Z. The
variable 'X' contains IDs and the variable 'Y' contains dates and the variable
'Z' contains income. The data is structured in a long format and you need to
convert it to wide format so that the dates moved to column names and the
income information comes under these dates. The snapshot of data and
desired output is shown below -
R : Convert Long to Wide Format
In reshape2 package, there are two function for transforming long-format

data to wide format. The functions are "dcast" and "acast". The only
difference between these two functions are as follows :
1.
dcast function returns a data frame as output.
2.
acast function returns a vector, matrix or array as output.
Install reshape2 package if not installed

already
if (!require(reshape2)){
install.packages('reshape2')
library(reshape2)
}
R Code : Transform Long to Wide Format

mydt = dcast(data,X~Y,value.var = "Z")
How dcast function works
1.
The
2.
The
3.
The
4.
The
values.
first parameter of dcast function refers to a data frame

left hand side of the casting function refers to ID variables.
right hand side refers to the variable to move to column name
value.var would contain a name of the variable that stores
Example 2 : More than 1 ID

Variable
Let's see another example wherein we have more than 1 ID variable. It
contains information about Income generated from 2 products - Product A
and B reported semi-annually.
Example of Transforming Data
library(reshape2)
xx=dcast(data, Year + SemiYear ~ Product, value.var = "Income")
In the above code, "Year + SemiYear" are the 2 ID variables. We want
"Product" variable to be moved to columns.
The output is shown below -
Output
If you want the final output to be reported at

year level
It seems to be VERY EASY (just remove the additional ID variable
'SemiYear'). But it's a little tricky. See the explanation below dcast(data, Year ~ Product, value.var = "Income")
Warning : Aggregation function missing:

defaulting to length
Year
ProductA
ProductB
The income values are incorrect in the above table.
We need to define the statistics to aggregate

income at year level. Let's sum the income to
report annual score.
dcast(data, Year ~ Product, fun.aggregate = sum, value.var = "Income")
Year
ProductA
ProductB
27446
23176
22324
24881
Convert Wide Format Data to

Long Format
Suppose you have data containing information of species and their sepal
length. The data of sepal length of species are in columns.
Wide to Long Format
Create Sample Data

mydata = read.table(text= "ID setosa versicolor virginica
1 5.1 NA NA
2 4.9 NA NA
3 NA 7 NA
4 NA 6.4 NA
5 NA NA 6.3
6 NA NA 5.8
", header=TRUE)
The following program would reshape data from wide to long
format.
library(reshape2)
x = colnames(mydata[,-1])
t = melt(mydata,id.vars = "ID",measure.vars = x , variable.name="Species",
value.name="Sepal.Length",na.rm = TRUE)
How melt function works :
1.
2.
id.vars - ID variables to keep in the final output.

measure.vars - variables to be transformed
3.
variable.name - name of variable used to store measured variable
names
4.
value.name - name of variable used to store values
LOOPS IN R
This tutorial explains how to write loops in R. It includes explanation

of APPLY family of functions and FOR LOOP with several examples
which makes writing R loops easy.
Loops with R
What is Loop?
Loops helps you to repeat the similar operation on different variables or on
different columns or on different datasets. For example, you want to multiple
each variable by 5. Instead of multiply each variable one by one, you can
perform this task in loop. Its main benefit is to bring down the duplication in
your code which helps to make changes later in the code.
Ways to Write Loop in R

For Loop
2.
While Loop
3.
Apply Family of Functions such as Apply,
Lapply, Sapply etc
1.
Apply Family of Functions

They are the hidden loops in R. They make loops easier to read and write.
But these concepts are very new to the programming world as compared to
For Loop and While Loop.
1. Apply Function
It is used when we want to apply a function to the rows or columns of a

matrix or data frame. It cannot be applied on lists or vectors.
apply arguments
Create a sample data set

dat <- data.frame(x = c(1:5,NA),
z = c(1, 1, 0, 0, NA,0),
y = 5*c(1:6))
Example 1 : Find Maximum value of each row

apply(dat, 1, max, na.rm= TRUE)
Output : 5 10 15 20 25 30
In the second parameter of apply function, 1 denotes the

function to be applied at row level.
Example 2 : Find Maximum value of each

column
apply(dat, 2, max, na.rm= TRUE)
The output is shown in the table below x
z
y
5
1
30
In the second parameter of apply function, 2 denotes the function
to be applied at column level.
2. Lapply Function
When we apply a function to each element of a data structure and it returns
a list.
lapply arguments
Example 1 : Calculate Median of each of the

variables
lapply(dat, function(x) median(x, na.rm = TRUE))
The function(x) is used to define the function we want to apply.
The na.rm=TRUE is used to ignore missing values and median would now
be calculated on non-missing values.
Example 2 : Apply a custom function

lapply(dat, function(x) x + 1)
In this case, we are adding 1 to each variables and the final output would
be a list and output is shown in the image below.
Output
3. Sapply Function
Sapply is a user friendly version of Lapply as it returns a vector when we
apply a function to each element of a data structure.
Example 1 : Number of Missing Values in each

Variable
sapply(dat, function(x) sum(is.na(x)))
The above function returns 1,1,0 for variables x,z,y in data frame 'dat'.
Example 2 : Extract names of all numeric

variables in IRIS dataset
colnames(iris)[which(sapply(iris,is.numeric))]
In this example, sapply(iris,is.numeric) returns TRUE/FALSE against each
variable. If the variable is numeric, it would return TRUE otherwise FALSE.
Later, which function returns the column position of the numeric variables .
Try running only this portion of the
code which(sapply(iris,is.numeric)). Adding colnames function would
help to return the actual names of the numeric variables.
Lapply and Sapply Together

In this example, we would show you how both lapply and sapply are used
simultaneously to solve the problem.
Create a sample data

dat <- data.frame(x = c(1:5,NA),
z = c(1, 1, 0, 0, NA,0),
y = factor(5*c(1:6)))
Converting Factor Variables to Numeric

The following code would convert all the factor variables of data frame 'dat'
to numeric types variables.
index <- sapply(dat, is.factor)
dat[index] <- lapply(dat[index], function(x) as.numeric(as.character(x)))
Explanation :
1.
2.
index would return TRUE / FALSE whether the variable is factor or not
Converting only those variables wherein index=TRUE.
4. For Loop
Like apply family of functions, For Loop is used to repeat the same task on
multiple data elements or datasets. It is similar to FOR LOOP in other
languages such as VB, python etc. This concept is not new and it has been in
the programming field over many years.
Example 1 : Maximum value of each column

x = NULL
for (i in 1:ncol(dat)){
x[i]= max(dat[i], na.rm = TRUE)}
x
Prior to starting a loop, we need to make sure we create an empty vector.
The empty vector is defined by x=NULL. Next step is to define the number of
columns for which loop over would be executed. It is done with ncol function.
The length function could also be used to know the number of column.
The above FOR LOOP program can be written like the code below x = vector("double", ncol(dat))
for (i in seq_along(dat)){
x[i]= max(dat[i], na.rm = TRUE)}
x
The vector function can be used to create an empty vector.
The seq_along finds out what to loop over.
Example 2 : Split IRIS data based on unique

values in "species" variable
for (i in 1:length(unique(iris$Species))) {
require(dplyr)
assign(paste("iris",i, sep = "."), filter(iris, Species ==
as.character(unique(iris$Species)[i])))
}
Combine Data within LOOP

In the example below, we are combining rows in iterative process.
Method 1 : Use do.call with rbind
do.call() applies a given function to the list as a whole. When it is used with
rbind, it would bind all the list arguments.
temp =list()
for (i in 1:length(unique(iris$Species))) {
series= data.frame(Species =as.character(unique(iris$Species))[i])
temp[[i]] =series
}
output = do.call(rbind, temp)
output
Method 2 : Use Standard Looping Technique

dummydt=data.frame(matrix(ncol=0,nrow=0))for (i in
1:length(unique(iris$Species))) {
series= data.frame(Species =as.character(unique(iris$Species))[i])
if (i==1) {output = rbind(dummydt,series)} else {output =
rbind(output,series)}
}
output
If we need to wrap the above code in function, we need to make some
changes in the code. For example, data$variable won't work inside the code .
Instead we should use data[[variable]]. See the code below dummydt=data.frame(matrix(ncol=0,nrow=0))
temp = function(data, var) {
for (i in 1:length(unique(data[[var]]))) {
series= data.frame(Species = as.character(unique(data[[var]]))[i])
if (i==1) {output = rbind(dummydt,series)} else {output =
rbind(output,series)}
}
return(output)}
temp(iris, "Species")
For Loop and Sapply Together

Suppose you are asked to impute Missing Values with Median in each of the
variable in a data frame. It's become a daunting task if you don't know how
to write a loop. Otherwise, it's a straightforward task.
for (i in which(sapply(dat, is.numeric))) {
dat[is.na(dat[, i]), i] <- median(dat[, i], na.rm = TRUE)
}
5. While Loop in R
A while loop is more broader than a for loop because you can rescript any for
loop as a while loop but not vice-versa.
In the example below, we are checking whether a number is an odd or even,
i=1
while(i<7)
{
if(i%%2==0)
print(paste(i, "is an Even number"))
else if(i%%2>0)
print(paste(i, "is an Odd number"))
i=i+1
}
The double percent sign (%%) indicates mod. Read i%%2 as mod(i,2). The
iteration would start from 1 to 6 (i.e. i<7). It stops when condition is met.
Output:
[1] "1 is an
[1] "2 is an
[1] "3 is an
[1] "4 is an
[1] "5 is an
[1] "6 is an
Odd number"
Even number"
Odd number"
Even number"
Odd number"
Even number"
Loop Concepts : Break and Next

Break Keyword
When a loop encounters 'break' it stops the iteration and breaks out of loop.
for (i in 1:3) {
for (j in 3:1) {
if ((i+j) > 4) {
break } else {
print(paste("i=", i, "j=", j))
}
}
}
Output :
[1] "i= 1 j= 3"
[1] "i= 1 j= 2"
[1] "i= 1 j= 1"
In this case, as condition i+j >4 is met, it breaks out of loop.
Next Keyword
When a loop encounters 'next', it terminates the current iteration and moves
to next iteration.
for (i in 1:3) {
for (j in 3:1) {
if ((i+j) > 4) {
next
} else {
print(paste("i=", i, "j=", j))
}
}
}
Output :
[1] "i= 1 j=
[1] "i= 1 j=
[1] "i= 1 j=
[1] "i= 2 j=
[1] "i= 2 j=
[1] "i= 3 j=
3"
2"
1"
2"
1"
1"
If you get confused between 'break' and 'next', compare the output of both
and see the difference.
ERROR HANDLING IN R
In R, we can handle errors with try() and inherits(object-name,'try-error').

mtry <- try(tuneRF(dt[, -3], dat3[,3], ntreeTry=100, stepFactor=1.5,improve=0.01))
if (!inherits(mtry, "try-error")) {
best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
rf <- randomForest(ID~.,data=dt, mtry=best.m, importance=TRUE, ntree=1000)
} else {
rf <- randomForest(ID~.,data=dt, importance=TRUE, ntree=1000)
}
R : CONVERTING A FACTOR TO INTEGER

Most of R Programmers make mistake while converting a factor variable to integer.
Let's create a factor variable

a <- factor(c(2, 4, 3, 3, 4))
str(a)
Incorrect Way
a1 = as.numeric(a)
str(a1)
as. numeric() returns a vector of the levels of your factor and not the original
values.
Correct Way
a2 = as.numeric(as.character(a))
str(a2)
R : CHARACTER FUNCTIONS
This tutorial lists some of the most useful character functions in R. It includes
concatenating two strings, extract portion of text from a string, extract word from a
string, making text uppercase or lowercase, replacing text with the other text etc.
Character Functions in R
Basics
In R, strings are stored in a character vector. You can create strings with a single
quote / double quote.
For example, x = "I love R Programming"
1. Convert object into character

type
The as.character function converts argument to character type. In the example
below, we are storing 25 as a character.
Y = as.character(25)
class(Y)
The class(Y) returns character as 25 is stored as a character in the previous line of
code.
2. Check the character type

To check whether a vector is a character or not, use is.character function.
x = "I love R Programming"
is.character(x)
Output : TRUE
Like is.character function, there are other functions such as is.numeric, is.integer
and is.array for checking numeric vector, integer and array.
3. Concatenate Strings
The paste function is used to join two strings. It is one of the most important
string manipulation task. Every analyst performs it almost daily to structure data.
Paste Function Syntax

paste (objects, sep = " ", collapse = NULL)
The sep= keyword denotes a separator or delimiter. The default separator is a
single space. The collapse= keyword is used to separate the results.
Example 1
x = "Deepanshu"
y ="Bhalla"
paste(x, y)
Output : Deepanshu Bhalla
paste(x, y, sep = ",")
Output : Deepanshu,Bhalla
Example 2 : To create column names from x1

through x10
paste("x", seq(1,10), sep = "")
Output : "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10"
Example 3 : Use of 'Collapse' keyword

paste("x", seq(1,10), sep="", collapse=",")
Output : "x1,x2,x3,x4,x5,x6,x7,x8,x9,x10"
Compare the output of Example 2 and Example3, you would understand the usage
of collapse keyword in paste function. Every sequence of x is separated by ",".
4. String Formatting
Suppose the value is stored in fraction and you need to convert it to percent.
The sprintf is used to perform C-style string formatting.
Sprintf Function Syntax

sprintf(fmt, ...)
The keyword fmt denotes string format. The format starts with the symbol %
followed by numbers and letters.
x = 0.25
sprintf("%.0f%%",x*100)
Output : 25%
Note : '%.0f' indicates 'fixed point' decimal notation with 0 decimal. The
extra % sign after 'f' tells R to add percentage sign after the number.
If you change the code to sprintf("%.2f%%",x*100), it would return 25.00%.
Other Examples
a = seq(1, 5)
sprintf("x%03d", a)
Output : "x001" "x002" "x003" "x004" "x005"
The letter 'd' in the format is used for numeric value.
sprintf("%s has %d rupees", "Ram", 500)
Output : "Ram has 500 rupees"
The letter 's' in the format is used for character string.
5. Extract or replace substrings

substr Syntax - substr(x, starting position,
end position)
x = "abcdef"
substr(x, 1, 3)
Output : abc
In the above example. we are telling R to extract string from 1st letter through 3rd
letter.
Replace Substring - substr(x, starting position, end position) = Value
substr(x, 1, 2) = "11"
Output : 11cdef
In the above example, we are telling R to replace first 2 letters with 11.
6. String Length
The nchar function is used to compute the length of a character value.

nchar(x)
Output : 20
It returns 20 as the vector 'x' contains 20 letters (including 3 spaces).
7. Replace the first match of the

string
sub Syntax -
sub(sub-string, replacement, x, ignore.case = FALSE)
if ignore.case is FALSE, the pattern matching is case sensitive and if TRUE, case is
ignored during matching.
sub("okay", "fine", "She is okay.")
Output : She is fine
In the above example, we are replacing the word 'okay' with 'fine'.
Let's replace all values of a vector

In the example below, we need to replace prefix 'x' with 'Year' in values of a vector.
cols = c("x1", "x2", "x3")
sub("x", "Year", cols)
Output : "Year1" "Year2" "Year3"
8. Extract Word from a String
Suppose you need to pull a first or last word from a character string.
Word Function Syntax (Library : stringr)

word(string, position of word to extract, separator)
Example
library(stringr)
word(x, 1,sep = " ")
Output : I
In the example above , '1' denotes the first word to be extract from a string. sep=" "
denotes a single space as a delimiter (It's the default delimiter in the word function)
Extract Last Word

library(stringr)
word(x, -1,sep = " ")
Output : Programming
In the example above , '-1' denotes the first word but started to be reading from the
right of the string. sep=" " denotes a single space as a delimiter (It's the default
delimiter in the word function)
9. Convert Character to
Uppercase / Lowercase
/Propercase
In many times, we need to change case of a word. For example. convert the case to
uppercase or lowercase.
Examples
tolower(x)
Output : "i love r programming"
The tolower() function converts letters in a string to lowercase.
toupper(x)
Output : "I LOVE R PROGRAMMING"
The toupper() function converts letters in a string to uppercase.
library(stringr)
str_to_title(x)
Output : "I Love R Programming"
The str_to_title() function converts first letter in a string to uppercase and the
remaining letters to lowercase.
10. Remove Leading and Trailing

Spaces
The trimws() function is used to remove leading and/or trailing spaces.
Syntax :
trimws(x, which = c("both", "left", "right"))
Default Option : both : It implies removing both leading and trailing whitespace.
If you want to remove only leading spaces, you can specify "left". For removing
trailing spaces,specify "right".
a = " Deepanshu Bhalla "

trimws(a)
It returns "Deepanshu Bhalla".
The str_trim() function from the stringr package eliminates leading and trailing
spaces.
x= " deepanshu bhalla "
library(stringr)
str_trim(x)
Output : "deepanshu bhalla"
11. Converting Multiple Spaces to

a Single Space
It's a challenging task to remove multiple spaces from a string and keep only a
single space. In R, it is possible to do it easily with qdap package.
x= "deepanshu
library(qdap)
Trim(clean(x))
bhalla"
Output : deepanshu bhalla
12. Repeat the character N times

In case you need to repeat the character number of times, you can do it with strrep
base R function.
strrep("x",3)
Output : "xxx"
13. Find String in a Character

Variable
The str_detect() function helps to check whether a sub-string exists in a string. It
is equivalent to 'contain' function of SAS. It returns TRUE/FALSE against each value.
x = c("Aon Hewitt", "Aon Risk", "Hewitt", "Google")
library(stringr)
str_detect(x,"Aon")
Output : TRUE TRUE FALSE FALSE
14. Splitting a Character Vector

In case of text mining. it is required to split a string to calculate the most frequently
used keywords in the list. There is a function called 'strsplit()' in base R to perform
this operation.
x = c("I love R Programming")
strsplit(x, " ")
Output : "I"
"love"
"R"
"Programming"
15. Selecting Multiple Values

The %in% keyword is used to select multiple values. It is the same function as IN
keyword in SAS and SQL.
x = sample(LETTERS,100, replace = TRUE)

x[x %in% c("A","B","C")]
In the example above, we are generating a sample of alphabets and later we are
subsetting data and selecting only A B and C.
16. Pattern Matching

Most of the times, string manipulation becomes a daunting task as we need to
match the pattern in strings. In these cases, Regex is a popular language to check
the pattern. In R, it is implemented with grepl function.
Example x = c("Deepanshu", "Dave", "Sandy", "drahim", "Jades")
1. Keeping characters starts with the letter

'D'
x[grepl("^D",x)]
Output : "Deepanshu" "Dave"
Note : It does not return 'drahim' as pattern mentioned above is case-sensitive.
To make it case-insensitive, we can add (?i) before ^D.
x[grepl("(?i)^d",x)]
2. Keeping characters do not start with the

letter 'D'
x[!grepl("(?i)^d",x)]
Output : "Sandy" "Jades"
3. Keeping characters end with 'S'

x[grepl("s$",x)]
Output : "Jades"
4. Keeping characters contain "S"

x[grepl("(?i)*s",x)]
Output : "Deepanshu" "Sandy"
"Jades"
R : APPLY FUNCTION ON ROWS

This tutorial explains how to apply functions on rows.
Sample Data
XYZ
650
6 3 NA
615
853
1 NA 1
872
Apply Function
When we want to apply a function to the rows or columns of a matrix or data frame.
It cannot be applied on lists or vectors.
apply arguments
Calculate maximum value across row

apply(data, 1, max)
It returns NA if NAs exist in a row. To ignore NAs, you can use the following line of
code.
apply(data, 1, max, na.rm = TRUE)
Calculate mean value across row

apply(data, 1, mean)
apply(data, 1, mean, na.rm = TRUE)
Calculate number of 0s in each row

apply(data == 0, 1, sum, na.rm= TRUE)
Calculate number of values greater than 5 in

each row
apply(data > 5, 1, sum, na.rm= TRUE)
Select all rows having mean value greater

than or equal to 4
df = data[apply(data, 1, mean, na.rm = TRUE)>=4,]
Remove rows having NAs

helper = apply(data, 1, function(x){any(is.na(x))})
df2 = data[!helper,]
It can be easily done with df2 = na.omit(data).
Count unique values across row

df3 = apply(data,1, function(x) length(unique(na.omit(x))))
R : KEEP / DROP COLUMNS FROM DATA FRAME

Deepanshu Bhalla 1 Comment R Tutorial
The article below explains how to keep or remove variables (columns) from data
frame. In R, there are multiple ways to select or drop variables.
Create a sample data frame

The following code creates a sample data frame that is used for demonstration.
set.seed(456)
mydata <- data.frame(a=letters[1:5], x=runif(5,10,50), y=sample(5), z=rnorm(5))
Sample Data
Drop columns by their names

Method I :
The most easiest way to drop columns is by using subset() function. In the code
below, we are telling R to drop variables x and z. The '-' sign indicates dropping
variables. Make sure the variable names would NOT be specified in quotes when
using subset() function.
df = subset(mydata, select = -c(x,z) )
Method II :
In this method, we are creating a character vector named drop in which we are
storing column names x and z. Later we are telling R to select all the variables
except the column names specified in the vector drop. The function names() returns
all the column names and the '!' sign indicates negation.
drop <- c("x","z")
df = mydata[,!(names(mydata) %in% drop)]
It can also be written like : df = mydata[,!(names(mydata) %in% c("x","z"))]
Drop columns by column index

numbers
It's easier to remove variables by their position number. All you just need to do is to
mention the column index number. In the following code, we are telling R to drop
variables that are positioned at first column, third and fourth columns.
df <- mydata[ -c(1,3:4) ]
Keep columns by their names

Method I :
In this section, we are retaining variables x and z.
keeps <- c("x","z")
df = mydata[keeps]
The above code is equivalent to df = mydata[c("x","z")]
Method II :
We can keep variables with subset() function.
df = subset(mydata, select = c(x,z) )
Related Content : Select columns by their

name pattern
R Function : Keep / Drop Column

Function
The following program automates keeping or dropping columns from a data frame.
KeepDrop = function(data=df,cols="var",newdata=df2,drop=1) {
# Double Quote Output Dataset Name
t = deparse(substitute(newdata))
# Drop Columns
if(drop == 1){
newdata = data [ , !(names(data) %in% scan(textConnection(cols), what="",
sep=" "))]}
# Keep Columns
else {
newdata = data [ , names(data) %in% scan(textConnection(cols), what="",
sep=" ")]}
assign(t, newdata, .GlobalEnv)
}
How to use the above function

To keep variables 'a' and 'x', use the code below. The drop = 0 implies
keeping variables that are specified in the parameter "cols". The
parameter "data" refers to input data frame. "cols" refer to the variables you want
to keep / remove. "newdata"refers to the output data frame.
KeepDrop(data=mydata,cols="a x", newdata=dt, drop=0)

To drop variables, use the code below. The drop = 1 implies removing
variableswhich are defined in the second parameter of the function.
KeepDrop(data=mydata,cols="a x", newdata=dt, drop=1)
R : KEEP / DROP COLUMNS BY THEIR NAME PATTERN

This article describes how to keep or drop columns by their name pattern. With
regular expression, we can easily keep or drop columns whose names contain a
special keyword or a pattern. It is very useful when we have a hell lot of variables
and we need to select only those columns having same pattern.
Let's create a sample data frame

The code below creates data for 4 variables named as follows :
INC_A SAC_A INC_B ASD_A
mydata = read.table(text="
INC_A SAC_A INC_B ASD_A
2 1 5 12
3 4 2 13
", header=TRUE)
Keep / Drop Columns by pattern
Keeping columns whose name starts with

"INC"
mydata1 = mydata[,grepl("^INC",names(mydata))]
The grepl() function is used to search for matches to a pattern. In this case, it is
searching "INC" at starting in the column names of data frame mydata. It returns
INC_A and INC_B.
Dropping columns whose name starts with

"INC"
The '!' sign indicates negation. It returns SAC_A and ASD_A.
mydata2 = mydata[,!grepl("^INC",names(mydata))]
Keeping columns whose name contain "_A" at

the end
The "$" is used to search for the sub-strings at the end of string. It returns INC_A,
SAC_A and ASD_A.
mydata12 = mydata[,grepl("_A$",names(mydata))]
Dropping columns whose name contain "_A"

at the end
mydata22 = mydata[,!grepl("_A$",names(mydata))]
Keeping columns whose name contain the

letter "S"
mydata32 = mydata[,grepl("*S",names(mydata))]
Dropping columns whose name contain the

letter "S"
mydata33 = mydata[,!grepl("*S",names(mydata))]
JOINING AND MERGING IN R

This tutorial explains how we can join (merge) two

tables in R.
Let's create two tables Table I : DF1
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
w = c('a', 'b', 'c', 'd', 'e'),
x = c(1, 1, 0, 0, 1),
y=rnorm(5),
z=letters[1:5])
ID
-1.250974459
1.234389053
0.796469701
-0.004735964
-0.729994828
Table II : DF2
df2 <- data.frame(ID = c(1, 7, 3, 6, 8),
a = c('z', 'b', 'k', 'd', 'l'),
b = c(1, 2, 3, 0, 4),
c =rnorm(5),
d =letters[2:6])
ID
0.9367346
-2.3464766
0.8259913
-0.8663029
-0.482444
Inner Join
df3 = merge(df1, df2, by ="ID")
If the primary key (matching variable) do not have same name in both the tables
(data frames),
df3 = merge(df1, df2, by.x ="ID", by.y="ID")
Left Join
df4 = merge(df1, df2, by ="ID", all.x = TRUE)
Right Join
df5 = merge(df1, df2, by ="ID", all.y = TRUE)
Full (Outer) Join

df6 = merge(df1, df2, by ="ID", all = TRUE)
Cross Join
df7 = merge(df1, df2, by = NULL)
With SQL Joins

library(sqldf)
df9 = sqldf('select df1.*, df2.* from df1 left join df2 on df1.ID = df2.ID')
R : SUMMARIZE DATA
Create a sample data

set.seed(1)
data <- data.frame(X = paste("s", sample(1:3, 15, replace = TRUE), sep = ""),Y =
ceiling(rnorm(15)), Z = rnorm(15), A = rnorm(15), B = rnorm(15))
Sample Data
Calculate Mean of Z by grouping variable X

dat1 = aggregate(Z ~ X, data=data, FUN=mean)
Calculate Mean of Z by 2 grouping variables

dat2 = aggregate(Z~ X + Y, data=data, FUN=mean)
Calculate Mean of Y and Z by grouping

variable X
dat3 = aggregate(cbind(Y,Z)~X, data=data, FUN=mean)
Calculate Mean of all the variable by

grouping variable X
dat4 = aggregate(.~X, data=data, FUN=mean)
Concatenate Text Based on

Criteria
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,25,44,55,77,99) )
aggregate(v2 ~ v1, data = testDF, FUN=paste, sep=",")
HOW TO USE INDEXING OPERATORS IN LIST IN R

R has main 3 indexing operators. They are as follows :

1.
[ ] = always returns a list with a single element.
2.
[[ ]] = returns a list element
3.
$ = returns elements from list that have names associated with it, not
necessarily same class
Examples
dat <- list( str='R', vec=c(1,2,3), bool=TRUE )
a = dat["str"]
a
class(a)
b = dat[["str"]]
b
class(b)
c = dat$str
c
class(b)
R Indexing Operators
Important Note
Both $ and [[ ]] works same. But it is advisable to use [[ ]] in functions and loops.
How to extract a list of list

dat1[[c("Bal02","ivtable")]]
dat1$Bal02$ivtable
SPLIT A DATA FRAME

This tutorial explains how to split a data frame with R programming.
Sample Data
Create Sample Data
data <- read.table(text="X Y Z

ID12 2012-06 566
ID1 2012-06 10239
ID6 2012-06 524
ID12 2012-07 2360
ID1 2012-07 13853
ID6 2012-07 2352
ID12 2012-08 3950
ID1 2012-08 14738
ID6 2012-08 4104",header=TRUE)
Split a data frame

mydt2 = split(data, data$Y)
Get first list element

mydt2[[1]]
Calculate mean on each list element

sapply(mydt2 , function(x) mean(x$Z))
Split a list into multiple data frames

for(i in 1:length(mydt2)) {
assign(paste0("t.", i), mydt2[[i]])
}
R : CONVERT DATA FROM WIDE TO LONG FORMAT

This tutorial explains how to convert data from wide to long format with R
programming.
R Code : Convert Data from Wide to Long
Note : Before running the code below, Install

reshape2 package if not installed already.
R Code
The following program would create a sample data for demonstration.
df = read.table(text= "ID Product1 Product2 Product3 Product4
1 1 NA 1 1
2 1 1 NA 1
3 1 1 NA NA
4 1 1 1 1", header=TRUE)
The following code would turn our data from wide to long format.
library(reshape2)
x = colnames(df[,-1])
t2 <- melt(df,id.vars = "ID",measure.vars = x , variable.name="Product",
value.name="value",na.rm = TRUE)
t2 = t2[order(t2$ID),]
Explanation :
1.
2.
3.
4.
id.vars - additional variables to keep in the output.

measure.vars - variables to be reshaped
variable.name - name of variable used to store measured variable names
value.name - name of variable used to store values
Note : If you do not want to remove NA values, make na.rm = TRUE to na.rm =
FALSE.
R WHICH FUNCTION EXPLAINED

In R, the which() function gives you the position of elements of a logical vector
that are TRUE.
Examples
1. which(letters=="z") returns 26.
Create a simple data frame

ls = data.frame( x1 = ceiling(runif(10)*10),
x2 = ceiling(runif(10)*10),
x3 = runif(10),
x4= rep(letters[1:5],2))
Sample Data Frame
2. Column number of variable "x4" in ls data set

i=which(names(ls)== "x4")
3. Row number in which maximum value of

variable "x1" exists
which(ls$x1 == max(ls$x1))
4. Row number in which conditions hold true

which(ls$x1 == 7 & ls$x2 == 4)
5. Number of cases in which variable x1 is equal to

variable x2
length(which(ls$x1 == ls$x2))
6. Which value is common in both the variables

ls[which(ls$x1 == ls$x2),"x1"]
7. Extract names of all the numeric variables

check = which(sapply(ls, is.numeric))
colnames(ls)[check]
HOW TO UPDATE R SOFTWARE
R software can be easily updated with "installr" package.

install.packages('installr')
library("installr")
updateR()
The dialog box will be opened to take you through the following steps 1.
It checks for a newer version of R.
2.
If one exists, the function will download the most updated R version and run
its installer.
3.
Once done, the function will offer to copy (or move) all of the packages from
the old R library to the new R library.
4.
It will then offer to update the moved packages, offer to open the new Rgui,
and lastely, it will quit the old R.
CONVERT BACKSLASH FILE PATH TO FORWARD SLASH IN R
In R, the file path must be in forward slash format. In Window OS, the file path is
placed in back slash format. Converting it to forward slash is a pain in the ass.
R Code : Converting backslash file path to

forward slash
FSlash<- function(path = "clipboard") {
y <- if (path == "clipboard") {
readClipboard()
} else {
cat("Please enter the path:\n\n")
readline()
}
x <- chartr("\\", "/", y)
writeClipboard(x)
return(x)
}
Step I : Run the above code (Once per session.

Ignore if already run once)
Step II : Copy path of your file
Step III : Run FSlash()
Step IV : Press CTRL V to get name of your file
path
SEND EMAIL FROM R

In R, there is a package named mailR that allows you to send emails from R.
R Code : Send Email from R

library(mailR)
send.mail(from="sandy.david@gmail.com",
to="deepanshu.bhalla@outlook.com",
subject="Test Email",
body="PFA the desired document",
html=T,
smtp=list(host.name = "smtp.gmail.com",
port = 465,
user.name = "sandy.david@gmail.com",
passwd = "xxxxxxxxx",
ssl = T),
authenticate=T,
attach.files="C:\\Users\\Deepanshu\\Downloads\\Nature of Expenses.xls")
You can add multiple recipients including the following code.
to = c("Recipient 1 <recipient1@gmail.com>", "recipient2@gmail.com"),
cc = c("CC Recipient <cc.recipient@gmail.com>"),
bcc = c("BCC Recipient <bcc.recipient@gmail.com>")
RUN SQL QUERIES IN R

This tutorial explains how to run sql queries in R with sqldf package.
Install and Load Package

install.packages("sqldf")
library(sqldf)
Create sample data

dt <- data.frame( ID = c('X1','X2','X4','X2','X1','X4','X3','X2','X1','X3'),
Value = c(4,3,1,3,4,6,6,1,8,4))
Example 1 : Select first 3 rows

x = sqldf("select * from dt limit 3")
Example 2 : Handle dot (.) in

Column and Table names
Put the names in double quotes
test <- data.frame( x.1 = 1:10 )
sqldf( 'SELECT "x.1" FROM test' )
test.2 = data.frame(x= sample(10))
sqldf( 'SELECT * FROM "test.2" ' )
Example 3 : Subset rows
x2 = sqldf("select * from dt where Value >= 4")
Example 4 : Concatenate two data

frames
x3 = sqldf("select * from x union all select * from x2")
Example 5 : Create a new variable

x4 = sqldf("select *, value*2 as newval from dt ")
Example 6 : Merge with another

table
dt2 <- data.frame( ID = c('A1','A2','A4','A2','A1','A4','A3','A2','A1','A3'),
ColID = c('Saving',
'Current',
'Loan',
'Current',
'Saving',
'Loan',
'Mortgage',
'Current',
'Saving',
'Mortgage'))
x5 = sqldf("select a.*,b.ColID from dt a left join (select distinct ID, ColID from dt2) b
on a.ID = b.ID")
Example 7 : Working with Dates

library(RH2)
test1 <- data.frame(sale_date = as.Date(c("2008-08-01", "2031-01-09","1990-0103")))
as.numeric(test1[[1]])
sqldf("select MAX(sale_date) from test1")
Example 8 : Cumulative Sum

library(RPostgreSQL)
# Upper case is folded to lower case by default so surround ID with double quotes
x6 = sqldf("select *, sum(Value) over (partition by "ID" order by Value) colsum from
dt ")
Example 9 : Ranking within Group

library(RPostgreSQL)
# Upper case is folded to lower case by default so surround ID with double quotes
x7 = sqldf("select *, rank() over (partition by "ID" order by Value) rank from dt")
MEASURING RUNNING TIME OF R CODE

To measure execution time of R code, we can use Sys.time function. Put it before
and after the code and take difference of it to get the execution time of code.
start.time <- Sys.time()
#Selecting Optimum MTRY parameter

mtry <- tuneRF(dev[, -1], dev[,1], ntreeTry=500, stepFactor=1.5,improve=0.05,
trace=TRUE, plot=TRUE)
best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
#Train Random Forest

rf <-randomForest(classe~.,data=dat3, mtry=best.m, importance=TRUE,ntree=50)
end.time <- Sys.time()

time.taken <- round(end.time - start.time,2)
time.taken
Result : Time difference of 30.15 secs
R : INSTALL AN ARCHIVED PACKAGE

In R, you can install an archived package by following the steps below.
Step I : Download and Install Rtools

Link - http://cran.r-project.org/bin/windows/Rtools/
Step II : Install devtools package
Step III : Specify archive location link below

url <- "http://cran.r-project.org/src/contrib/Archive/DWD/DWD_0.11.tar.gz"
pkgFile <- "DWD_0.11.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
R : DELETING COLUMNS WHERE CERTAIN % OF MISSING VALUES

In R, we can calculate % of missing values with colMeans(is.na(test)) and then

set logical condition to get logical values and then use it to delete columns.
final <- test[, colMeans(is.na(test)) <= .5]
Note : test is a data frame
R : CONVERTING MULTIPLE NUMERIC VARIABLES TO FACTOR

In R, you can convert multiple numeric variables to factor using lapply function.
The lapply function is a part of apply family of functions. They perform multiple
iterations (loops) in R. In R, categorical variables need to be set as factor variables.
Some of the numeric variables which are categorical in nature need to be
transformed to factor so that R treats them as a grouping variable.
Converting Numeric Variables to Factor

1.
Using Column Index Numbers
In this case, we are converting first, second, third and fifth numeric variables to
factor variables. mydata is a data frame.
names <- c(1:3,5)
mydata[,names] <- lapply(mydata[,names] , factor)
str(mydata)
2. Using Column Names
In this case, we are converting two variables 'Credit' and 'Balance' to factor
variables.
names <- c('Credit' ,'Balance')
mydata[,names] <- lapply(mydata[,names] , factor)
str(mydata)
3. Converting all variables
col_names <- names(mydata)

mydata[,col_names] <- lapply(mydata[,col_names] , factor)
4. Converting all numeric variables
mydata[sapply(mydata, is.numeric)] <- lapply(mydata[sapply(mydata, is.numeric)],
as.factor)
5. Checking unique values in a variable and convert to factor only those
variables having unique count less than 4
col_names <- sapply(mydata, function(col) length(unique(col)) < 4)
mydata[ , col_names] <- lapply(mydata[ , col_names] , factor)
R : EXTRACTING NUMERIC AND FACTOR VARIABLES

In R, you can extract numeric and factor variables using sapply function.
Extracting Numeric Variables

# Extracting Numeric Variables
cols <- sapply(mydata, is.numeric)
abc = mydata [,cols]
Note : mydata is a dataframe
Extracting Factor Variables

# Extracting Numeric Variables
cols <- sapply(mydata, is.factor)
abc = mydata [,cols]
Note : mydata is a dataframe
INSTALL R PACKAGE DIRECTLY FROM GITHUB

In R, you can install packages directly from Github with simple 2-3 line of codes.
Step I : Install and load devtools package

install.packages("devtools")
library(devtools)
Step II : Install Package from GitHub

install_github("tomasgreif/woe")
Possible Errors while installing package from

GitHub
I. Error : Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called stringi
Solution : Run "install.packages("stringi")" before
running install_githubcommand
II. Error : Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was
reached
Solution : Step 1. Update configuration for your proxy in the code below -
library(httr)
set_config(use_proxy(url="proxy.xxxx.com", port=80,
username="user",password="password"))
How to find proxy server settings :

Start>control panel>Network and Internet > click on Internet Options > click on
connections tab then right at the bottom click on LAN settings > Check Proxy
Server Settings
Step 2. Run this command - install_github("tomasgreif/woe")
Important Note : If you want to build R package, you need to install Rtools along
with devtools package. Rtools is not a package but an executable file.
You can download Rtools from this link - https://cran.rproject.org/bin/windows/Rtools/ . OR

install.packages('installr')
install.Rtools()
CREATE PASSWORD GENERATOR APP WITH R

This tutorial explains how to create password generator utility with R.
R Function - Password Generator

Features 1.
Each password contains numeric, special character, lowercase letter and
uppercase letter.
2.
Flexibility to define the length of a password
3.
Flexibility to specify the number of passwords you want to generate
password.generator <- function(len, n){
dummydt=data.frame(matrix(ncol=0,nrow=n))
num <- 1:9
spcl <- c("!", "#", "$", "%", "&", "(", ")", "*", "+", "-", "/", ":",
";", "<", "=", ">", "?", "@", "[", "^", "_", "{", "|", "}", "~")
comb <- c(num, spcl, letters, LETTERS)
p <- c(rep(0.035, 9), rep(0.015, 25), rep(0.025, 52))

password<-replicate(nrow(dummydt),paste0(sample(comb, len, TRUE, prob = p),
collapse = ""))
dummydt$password<-password
return(dummydt)
}
PasswrdFile = password.generator(len = 8, n = 100)
Parameters 1.
2.
len - Length of a password

n - number of passwords you want to generate
READING LARGE CSV FILE WITH R

This tutorial explains how to read large CSV files with R. I have tested this code upto
6 GB File.
Method I : Using data.table library

library(data.table)
yyy = fread("C:\\Users\\Deepanshu\\Documents\\Testing.csv", header = TRUE)
Method II : Using bigmemory library

library(bigmemory)
y <- read.big.matrix("C:\\Users\\Deepanshu\\Documents\\Testing.csv", type =
"integer", header=TRUE)
dim(y)
#coerce a big.matrix to a matrix

yy= as.matrix(y)
CREATE DUMMY COLUMNS FROM CATEGORICAL VARIABLE

The following code returns new dummy columns from a categorical variable.
DF <- data.frame(strcol = c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
for(level in unique(DF$strcol)){
DF[paste("strcol", level, sep = "_")] <- ifelse(DF$strcol == level, 1, 0)}
R FUNCTION : CONVERT CATEGORICAL VARIABLES TO CONTINUOUS

VARIABLES
In classification models, we generally encounter a situation when we have too many

categories or levels in independent variables. The simple solution is to convert the
categorical variable to continuous and use the continuous variables in the model.
The easiest way to convert categorical variables to continuous is by replacing raw
categories with the average response value of the category.
Adjusted Mean Value for Categorical

Predictor
To have a different value against Y=1 and Y=0 for a categorical predictor, we can
adjust the average response value of the category,
Convert Categorical Variables to Continuous Variables
R Function: Converting Categorical Variables

to Continuous
# Creating dummy data
set.seed(123)
mydata = data.frame(y= ifelse(sign(rnorm(100))==-1,0,1),
x1= sample(LETTERS[1:5],100,replace = TRUE),
x2= factor(sample(1:7, 100, replace = TRUE)))
# Convert categorical variables to continuous variables

TransformCateg <- function(y,x,inputdata,cutoff){
for (i in seq(1,length(x),1)) {
if (class(inputdata[,x[i]]) %in% c("factor", "character")){
len <- NULL
t1 <- aggregate(inputdata[,y], list(inputdata[,x[i]]), mean)
names(t1)[2] <- "avg"
t2 <- aggregate(inputdata[,y], list(inputdata[,x[i]]), length)
names(t2)[2] <- "len"
temp <- merge(t1, t2, by = "Group.1")
t1 <- subset(temp, len >= cutoff)
t2 <- subset(temp, len < cutoff)
if(nrow(t2) > 0)
{
t2$avg <- sum(t2$avg*t2$len)/sum(t2$len)

t2$len <- sum(t2$len)
}
temp <- rbind(t1, t2)
inputdata <- merge(inputdata, temp, by.x = x[i], by.y = "Group.1", all.x = T)
inputdata[,paste(x[i],"mean", sep="_")] <- ((inputdata$avg * inputdata$len) (inputdata[,y]))/(inputdata$len - 1)
inputdata <- inputdata[, !(colnames(inputdata) %in% c("avg","len"))]
}
else{
warning(paste(x[i], " is not a factor or character variable", sep = ""))
}
}
return(inputdata)
}
# Run Function
train2 = TransformCateg(y= "y",x= c("x1","x2"), inputdata = mydata, cutoff = 15)
Parameters of TransformCateg
Function
1.
y : Response or target or dependent variable - categorical or continuous
2.
x : a list of independent variables or predictors - Factor or Character Variables
3.
inputdata : name of input data frame
4.
cutoff : minimum observations in a category. All the categories having
observations less than the cutoff will be a different category.
R Script : WOE Transformation of Categorical

Variables
CARET PACKAGE IMPLEMENTATION IN R

Deepanshu Bhalla 1 Comment data mining, Machine Learning, R Tutorial, Statistics
In R, there is a package called caret which stands for Classification And REgression
Training. It makes predictive modeling easy. It can run most of the predive modeling
techniques with cross-validation. It can also perform data slicing and pre-processing
data modeling steps.
Loading required libraries

library(C50)
library(ROCR)
library(caret)
library(plyr)
Set Parallel Processing - Decrease

computation time
install.packages("doMC")
library(doMC)
registerDoMC(cores = 5)
Splitting data into training and validation

The following code splits 60% of data into training and remaining into validation.
trainIndex <- createDataPartition(data[,1], p = .6, list = FALSE, times = 1)
dev <- data[ trainIndex,]
val <- data[-trainIndex,]
In this code, a data.frame named "data" contains full dataset. The list =
FALSEavoids returns the data as a list. This function also has an argument, times,
that can create multiple splits at once; the data indices are returned in a list of
integer vectors.
Similarly, createResample can be used to make simple bootstrap samples
and createFolds can be used to generate balanced crossvalidation groupings
from a set of data.
K Fold Cross-Validation - C5.0

cvCtrl <- trainControl(method = "repeatedcv", number =10, repeats =3, classProbs
= TRUE)
Explanation :
1.
repeatedcv : K-fold cross-validation
2.
number = 10 : 10-fold cross-validations
3.
repeats = 3 : three separate10-fold cross-validations are used.
4.
classProbs = TRUE : It should be TRUE if metric = " ROC " is used in the
train function. It can be skipped if metric = "Kappa" is used.
Note : Kappa measures accuracy.
There are two ways to tune an algorithm in

the Caret R package :
1.
tuneLength = It allows system to tune algorithm automatically. It
indicates the number of different values to try for each tunning parameter. For
example,mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different
mtry values and find the optimal mtry value based on these 5 values.
2.
tuneGrid = It means user has to specify a tune grid manually. In the grid,
each algorithm parameter can be specified as a vector of possible values. These
vectors combine to define all the possible combinations to try.
For example, grid = expand.grid(.mtry= c(1:100))
grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000,
by = 50), .shrinkage = c(0.01, 0.1))
Example 1 : train with tuneGrid (Manual Grid)

grid <- expand.grid(.model = "tree", .trials = c(1:100), .winnow = FALSE)
set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
tuneGrid = grid, trControl = cvCtrl)
Example 2 : train with tunelength (Automatic

Grid)
set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
tunelength = 10, trControl = cvCtrl)
Finding the Tuning Parameter for each of the

algorithms
Visit this link - http://topepo.github.io/caret/modelList.html
Calculating the Variable Importance

varImp(tuned$finalModel , scale=FALSE)
plot(varImp(tuned$finalModel))
To get the area under the ROC curve for each predictor, the filterVarImp function
can be used. The area under the ROC curve is computed for each class.
RocImp <- filterVarImp(x = dev[, -1], y = dev[,1])
RocImp
# Seeing result
tuned
# Seeing Parameter Tuning
trellis.par.set(caretTheme())
plot(tuned, metric = "ROC")
# Seeing final model result
print(tuned$finalModel)
#Summaries of C5.0 Model
summary(tuned$finalModel)
# variable Importance
C5imp(tuned$finalModel, metric="usage")
#Scoring
val1 = predict(tuned$finalModel, val[, -1], type = "prob")
Other Useful Functions
1.
nearZeroVar: a function to remove predictors that are sparse and highly
unbalanced
2.
findCorrelation: a function to remove the optimal set of predictors to
achieve low pairwise correlations (Check out this link)
3.
preProcess: Variable selection using PCA
4.
predictors: class for determining which predictors are included in the
prediction equations (e.g. rpart, earth, lars models) (currently7 methods)
5.
confusionMatrix, sensitivity, specificity,
posPredValue, negPredValue:classes for assessing classifier performance
CARET PACKAGE IMPLEMENTATION IN R [PART II]

This article explains about useful functions of caret package in R. If you are new
to the caret package, check out Part I Tutorial.
Method Functions in trainControl Parameter

1.
none - No cross validation or Bootstrapping
2.
boot - Bootstrapping
3.
cv - Cross validation
4.
repeatedcv - Repeated Cross Validation
5.
oob - Out of Bag (only for random forest, bagged trees, bagged earth,
bagged flexible discriminant analysis, or conditional tree forest models)
Example
ctrl <- trainControl(method = "cv", classProbs = TRUE, summaryFunction =
twoClassSummary, number = 5)
Selecting the Least Complex Model

Step I : Train your model
set.seed(825)
gbmFit3 <- train(Class ~ ., data = training, method = "gbm", trControl =
fitControl, verbose = FALSE, tuneGrid = gbmGrid, metric = "ROC")
Step II : Tolerance function selects the least complex model within some percent
tolerance of the best value. In the formula below, tol =2 means 2% loss of AUC
score.
whichTwoPct <- tolerance(gbmFit3$results, metric = "ROC", tol = 2, maximize =
TRUE)
cat("best model within 2 pct of best:\n")
gbmFit3$results[whichTwoPct,1:6]
CREATE WORDCLOUD WITH R

Deepanshu Bhalla 22 Comments R Tutorial, Text Analytics, Text Mining
A wordcloud is a text mining technique that allows us to visualize most frequently

used keywords in a paragraph.
The example wordcloud is shown below :
Create WordCloud with R Programming
How to create Word Cloud with R
Step 1 : Install the required

packages
install.packages("wordcloud")
install.packages("tm")
install.packages("ggplot2")
Note : If these packages are already installed, you don't need to install them again.
Step 2 : Load the above installed

packages
library("wordcloud")
library("tm")
library(ggplot2)
Step 3 : Import data into R
Import a single file

cname<-read.csv("C:/Users/Deepanshu Bhalla/Documents/Text.csv",head=TRUE)
Import multiple files from a folder
setwd("C:\\Users\\Deepanshu Bhalla\\Documents\\text mining")

cname <-getwd()
## Number of documents
length(dir(cname))
## list file names
dir(cname)
Note : In the above syntax, "text mining" is a folder name. I have placed all text
files in this folder
Step 4 : Locate and load the

corpus
If imported a single file
docs<-Corpus(VectorSource(cname[,1]));
If imported multiple files from a folder
docs <- Corpus (DirSource(cname))

docs
summary(docs)
inspect(docs[1])
Step 5 : Data Cleaning

# Simple Transformation
for (j in seq(docs))
{
docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
docs[[j]] = gsub("@\\w+", "", docs[[j]])
docs[[j]] = gsub("http\\w+", "", docs[[j]])
docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])
docs[[j]] = gsub("^\\s+|\\s+quot;", "", docs[[j]])
docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
}
# Specify stopwords other than in-bult english stopwords
skipwords = c(stopwords("english"), "system","technology")
kb.tf <- list(weighting = weightTf, stopwords = skipwords,
removePunctuation = TRUE,
tolower = TRUE,
minWordLength = 4,
removeNumbers = TRUE, stripWhitespace = TRUE,
stemDocument= TRUE)
# term-document matrix
docs <- tm_map(docs, PlainTextDocument)
tdm = TermDocumentMatrix(docs, control = kb.tf)
# convert as matrix
tdm = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
Step 6 : Create WordCloud with R

# Keep wordcloud the same
set.seed(123)
#Plot Histogram
p <- ggplot(subset(dm, freq>10), aes(word, freq))
p <-p+ geom_bar(stat="identity")
p <-p+ theme(axis.text.x=element_text(angle=45, hjust=1))
p
#Plot Wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(6,
"Dark2"),min.freq=10, scale=c(4,.2),rot.per=.15,max.words=100)
Note : You can remove sparse terms with the following code :
tdm.frequent = removeSparseTerms(tdm, 0.1)

Getting Started With R

Uploaded by

Getting Started With R

Uploaded by

GETTING STARTED WITH R

5. R uses forward slashes instead of backward slashes in filenames (as

9. R uses NA to represent Not Available, or missing values.

11. The form 1:10 generates the integers from 1 to 10.

17. Editing functions in R

18. Retrieve your previous command

19. Install packages (or Add-Ins)

20. Save data to R

21. To tell R which data set to use

To calculate mean for a vector, you can use mean function.

You can use subscripts to refer elements of a vector.

Convert a column "x" to numeric

Simplest form of the factor function :

Ideal form of the factor function :

The factor function has three parameters:

To see correlation of the matrix, you can use cor function.

You can use subscripts to identify rows or columns.

How to know data type of a column

R : CREATE DUMMY DATA

Method 1 : Enter Data Manually

Method 2 : Sequence of numbers, letters,

Method 3 : Create numeric grouping variable

Method 4 : Random Numbers with mean 0

Method 5 : Create binary variable (0/1)

Method 6: Copy Data from Excel to R

IMPORTING DATA INTO R

Read Data into R

Note : R uses forward slash instead of backward slash in filename

If you have the header row in the first row

If you want to set any value to a missing

If you want to set multiple values to missing

2. Reading a tab-delimited text

Note : R uses forward slash instead of backward slash in filename

If you want to set any value to a missing

If you want to set multiple values to missing

3. Reading Excel File

Step 1 : Install the package once

Step 2 : Define path and sheet name in the

4. Reading SAS File

Step 2 : Define path in the code below

5. Reading SPSS File

Step 2 : Define path in the code below

6. Load Data from R

Exporting Data with R

1. Writing comma-delimited text

2. Writing tab-delimited text file

3. Writing Excel File

Step 2 : Define path and sheet name in the

4. Writing SAS File

Step 2 : Define path in the code below

5. Writing SPSS File

Step 2 : Define path in the code below

COPY DATA FROM EXCEL TO R

Deepanshu Bhalla 2 Comments R Tutorial

Step 1 : Prepare Data in Excel

" section of read,table (See the image below)

Prepare Data in Excel and Paste it to R Editor

READING AND SAVING DATA FILE IN R SESSION

Suppose you want to save an individual object in R and read it later.

Saving data file in R session

Reading stored data from R session

Another way : Saving data file in R session

Loading stored data from R session

Saving multiple objects in R session

Saving everything in R session

DATA EXPLORATION WITH R

Import data into R

that header is included in the data that we are going to import.

1. Calculate basic descriptive

Data Exploration with R

2. Lists name of variables in a

3. Calculate number of rows in a

4. Calculate number of columns in

5. List structure of a dataset

6. See first 6 rows of dataset

7. First n rows of dataset