Getting Started With R
Getting Started With R
Download -
1.
To download R, visit official site of R - http://www.r-project.org/.
2.
Click on the link "CRAN" located at the left hand side of the page.
3.
Choose your country and click on the link available for your location.
4.
Click Download R for Windows (For Windows)
5.
click base
Follow the steps shown in the image below.
R Screen
Basics of R Programming
1. Write your first equation in R
Enter your equation in the command window after the ">" (without quotes)
and hit ENTER.
Hit ENTER after completing your equation in the command window to see
result. The [1] tells you the resulting value is the first result.
2. The # character at the beginning of a line signifies a comment.
3. The operator "<" (without quotes) is equivalent to "=" sign . You can use either of
the operators.
R syntax editor
Go to File >> Click on New Script >> In the new R Editorwindow, write a
code and Press F5 to run it after highlighting the code.
4. The getwd() function shows the working directory
8. Press CTRL + ENTER to enter a continued line. The "+" operator shows
line continuation.
10. To calculate sum excluding NA, use na.rm = TRUE (By default, it is FALSE).
12. R is case-sensitive, so you have to use the exact case that the program
requires.
13. To get help for a certain function such as sum, use the form: help
(sum)
14. Object names in R can be any length consisting of letters, numbers,
underscores _ or the period .
15. Object names in R should begin with a letter.
16. Unlike SAS and SPSS, R has several different data structures including vectors,
factors, data frames, matrices, arrays, and lists. The data frame is most like a dataset in
SAS.
Data Types
Unlike SAS and SPSS, R has several different data types (structures) including vectors,
factors, data frames, matrices, arrays, and lists. The data frame is most like a dataset in
SAS.
1. Vectors
A vector is an object that contains a set of values called its elements.
Numeric vector
x <- c(1,2,3,4,5,6)
The operator < is equivalent to "=" sign.
Character vector
State <- c("DL", "MU", "NY", "DL", "NY", "MU")
To calculate frequency for State vector, you can use table function.
Since the above vector contains a NA (not available) value, the mean
function returns NA.
To calculate mean for a vector excluding NA values, you can
include na.rm = TRUE parameter in mean function.
2. Factors
R has a special data structure to store categorical variables. It tells R that a
variable is nominal or ordinal by making it a factor.
3. Matrices
All values in columns in a matrix must have the same mode (numeric, character, etc.)
and the same length.
The cbind function joins columns together into a matrix. See the usage below
The numbers to the left side in brackets are the row numbers. The form [1, ] means that
it is row number one and the blank following the comma means that R has displayed all
the columns.
To see dimension of the matrix, you can use dim function.
4. Arrays
Arrays are similar to matrices but can have more than two dimensions.
5. Data Frames
A data frame is similar to SAS and SPSS datasets. It contains variables and records.
It is more general than a matrix, in that different columns can have different modes
(numeric, character, factor, etc.
The data.frame function is used to combine variables (vectors and factors)
into a data frame.
6. Lists
A list allows you to store a variety of objects.
You can use subscripts to select the specific component of the list.
[1] "factor"
> mode(x)
[1] "numeric"
set.seed(1)
df4 <- data.frame(Y = rnorm(15), Z = ceiling(rnorm(15)))
This tutorial explains how to get external data into R. It describes how to load data
from various sources such as CSV, text, excel. SAS or SPSS.
Importing Data in R
Loading data into the tool is one of the initial step of any project. If you have just
started using R, you would soon need to read in data from other sources.
1. Reading a comma-delimited
text file (CSV)
If you don't have the names of the variables
in the first row
mydata <- read.csv("c:/mydata.csv", header=FALSE)
EXPORTING DATA IN R
Deepanshu Bhalla Add Comment R Tutorial
Exporting Data in R
This tutorial explains how we can create data in MS Excel and paste it to R syntax
editor window to create a table in R. MS Excel is one of the most frequently used
tools in analytics industry. Many companies have switched their core analytics work
from Excel to R / SAS but we still find some of the data in excel file.
If you prefer to import excel file to R rather than copying and paste excel data to R,
you can check out this tutorial -
Importing Data to R
data = read.table(text="
XYZ
650
6 3 NA
615
8 5 3", header=TRUE)
It creates 3 columns - X, Y and Z. The header = TRUE tells R to consider
first row as header.
This article demonstrates how to explore data with R. It is very important to explore
data before starting to build a predictive model. It gives an idea about the structure
of the dataset like number of continuous or categorical variables and number of
observations (rows).
Dataset
The snapshot of the dataset used in this tutorial is pasted below. We have five
variables - Q1, Q2, Q3, Q4 and Age. The variables Q1-Q4 represents survey
responses of a questionnaire. The response lies between 1 and 6. The variable Age
represents age groups of the respondents. It lies between 1 to 3. 1 represents
Generation Z, 2 represents Generation X and Y, 3 represents Baby Boomers.
Sample Data
To calculate summary of a particular column, say third column, you can use the
following syntax :
summary( mydata[3])
To calculate summary of a particular column by its name, you can use the following
syntax :
summary( mydata$Q1)
This tutorial covers how to execute most frequently used data manipulation tasks
with R. It includes various examples with datasets and code. This tutorial is
designed for beginners who are very new to R programming language. It gives you
a quick look at several functions used in R.
mydata$State = as.character(mydata$State)
mydata$State[mydata$State=='Delhi'] <- 'Mumbai'
In this example, we are replacing 2 and 3 with NA values in whole dataset.
mydata[mydata == 2 | mydata == 3] <- NA
Another method
You have to first install the car package.
# Install the car package
install.packages("car")
# Load the car package
library("car")
# Recode 1 to 6
mydata$Q1 <- recode(mydata$Q1, "1=6")
3. Renaming variables
To rename variables, you have to first install the dplyr package.
# Install the plyr package
install.packages("dplyr")
newdata<-subset(mydata, !is.na(age))
6. Sorting
Sorting is one of the most common data manipulation task. It is generally used
when we want to see the top 5 highest / lowest values of a variable.
Sorting a vector
x= sample(1:50)
x = sort(x, decreasing = TRUE)
The function sort() is used for sorting a 1 dimensional vector. It cannot be used for
more than 1 dimensional vector.
7. Value labeling
Use factor() for nominal data
mydata$Gender <- factor(mydata$Gender, levels = c(1,2), labels = c("male",
"female"))
Use ordered() for ordinal data
9. Aggregate by groups
The following code calculates mean for variable "x" by grouped variable "y".
samples = data.frame(x =c(rep(1:10)), y=round((rnorm(10))))
mydata <- aggregate(x~y, samples, mean, na.rm = TRUE)
To calculate frequency for State vector, you can use table() function.
The function rbind() does not work when the column names do not match in the two
datasets. For example, dataframe1 has 3 column A B and C . dataframe2 also has 3
columns A D E. The function rbind() throws an error. The function smartbind() from
gtools would combine column A and returns NAs where column names do not
match.
install.packages("gtools") #If not installed
library(gtools)
mydata <- smartbind(mydata1, mydata2)
Next Step :
Learn Data Manipulation with dplyr Package
What is dplyr?
dplyr Tutorial
Description
Equivalent SQL
select()
SELECT
filter()
WHERE
group_by()
GROUP BY
summarise()
arrange()
ORDER BY
join()
JOIN
mutate()
COLUMN ALIAS
Input Dataset
Submit the following code. Change the file path in the code below.
mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")
select( ) Function
It is used to select only desired variables.
select() syntax : select(data , ....)
data : Data Frame
.... : Variables by name or by function
Description
starts_with()
ends_with()
contains()
matches()
num_range()
one_of()
everything()
All variables.
rename( ) Function
It is used to change variable name.
rename() syntax : rename(data , new_name = old_name)
data : Data Frame
new_name : New variable name you want to keep
old_name : Existing Variable Name
Output
filter( ) Function
It is used to subset data with matching logical conditions.
filter() syntax : filter(data , ....)
data : Data Frame
.... : Logical Condition
summarise( ) Function
It is used to summarize data.
summarise() syntax : summarise(data , ....)
data : Data Frame
..... : Summary Functions such as mean, median etc
Output
Output
Summarize : Output
arrange() function :
Use : Sort data
Syntax
arrange(data_frame, variable(s)_to_sort)
or
data_frame %>% arrange(variable(s)_to_sort)
To sort a variable in descending order, use desc(x).
Syntax :
filter(data_frame, variable == value)
or
data_frame %>% filter(variable == value)
The %>% is NOT restricted to filter function. It can be used with any
function.
Example :
The code below demonstrates the usage of pipe %>% operator. In this example, we
are selecting 10 random observations of two variables "Index" "State" from the data
frame "mydata".
dt = sample_n(select(mydata, Index, State),10)
or
dt = mydata %>% select(Index, State) %>% sample_n(10)
Output
group_by() function :
Use : Group data by categorical variable
Syntax :
group_by(data, variables)
or
data %>% group_by(variables)
do() function :
Output
mutate() function :
mutate(data_frame, expression(s) )
or
data_frame %>% mutate(expression(s))
Output
The output shown in the image above is truncated due to high number of variables.
Output
By default, min_rank() assigns 1 to the smallest value and high number to the
largest value. In case, you need to assign rank 1 to the largest value of a variable,
use min_rank(desc(.))
mydata13 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(desc(.))))
join() function :
Use : Join two datasets
Syntax :
inner_join(x, y, by = )
left_join(x, y, by = )
right_join(x, y, by = )
full_join(x, y, by = )
semi_join(x, y, by = )
anti_join(x, y, by = )
x, y - datasets (or tables) to merge / join
by - common variable (primary key) to join by.
y=rnorm(5),
z=letters[1:5])
df2 <- data.frame(ID = c(1, 7, 3, 6, 8),
a = c('z', 'b', 'k', 'd', 'l'),
b = c(1, 2, 3, 0, 4),
c =rnorm(5),
d =letters[2:6])
INNER JOIN returns rows when there is a match in both tables. In this example, we
are merging df1 and df2 with ID as common variable (primary key).
df3 = inner_join(df1, df2, by = "ID")
If the primary key does not have same name in both the tables, try the following
way:
inner_join(df1, df2, by = c("ID"="ID1"))
union(x, y)
Rows that appear in either or both x and y.
setdiff(x, y)
Rows that appear in x but not y.
Output
Output
Suppose you are asked to combine two data frames. Let's first create two sample
datasets.
df1=data.frame(ID = 1:6, x=letters[1:6])
df2=data.frame(ID = 7:12, x=letters[7:12])
Input Datasets
The bind_rows() function combine two datasets with rows. So combined dataset
would contain 12 rows (6+6) and 2 columns.
xy = bind_rows(df1,df2)
It is equivalent to base R function rbind.
xy = rbind(df1,df2)
The bind_cols() function combine two datasets with columns. So combined
dataset would contain 4 columns and 6 rows.
xy = bind_cols(x,y)
or
xy = cbind(x,y)
The output is shown below-
cbind Output
Endnotes
There are hundreds of packages that are dependent on this package. The main
benefit it offers is to take off fear of R programming and make coding effortless and
lower processing time. However, some R programmers prefer data.table package
for its speed. I would recommend learn both the packages. Check out data.table
tutorial. The data.table package wins over dplyr in terms of speed if data size
greater than 1 GB.
data.table Tutorial
data.table Syntax
The syntax of data.table is shown in the image below :
data.table Syntax
DT[ i , j , by]
1.
The first parameter of data.table i refers to rows. It implies subsetting
rows. It is equivalent to WHERE clause in SQL
2.
The second parameter of data.table j refers to columns. It implies
subsetting columns (dropping / keeping). It is equivalent to SELECT clause in
SQL.
3.
The third parameter of data.table by refers to adding a group so that
all calculations would be done within a group. Equivalent to SQL's GROUP
BYclause.
The data.table syntax is NOT RESTRICTED to only 3
parameters. There are other arguments that can be added to data.table
syntax. The list is as follows 1.
with, which
2.
allow.cartesian
3.
roll, rollends
4.
.SD, .SDcols
5.
on, mult, nomatch
The above arguments would be explained in the latter part of the post.
Read Data
In data.table package, fread() function is available to read or get data from
your computer or from a web page. It is equivalent to read.csv() function of
base R.
mydata = fread("https://github.com/arunsrinivasan/satrdaysworkshop/raw/master/flights_2014.csv")
Describe Data
This dataset contains 253K observations and 17 columns. It constitutes
information about flights' arrival or departure time, delays, flight cancellation
Dropping a Column
Suppose you want to include all the variables except one column, say.
'origin'. It can be easily done by adding ! sign (implies negation in R)
dat5 = mydata[, !c("origin"), with=FALSE]
Rename Variables
You can rename variables with setnames() function. In the following code,
we are renaming a variable 'dest' to 'destination'.
Set Key
In this case, we are setting 'origin' as a key in the dataset mydata.
# Indexing (Set Keys)
setkey(mydata, origin)
Note : It makes the data table sorted by the column 'origin'.
Performance Comparison
You can compare performance of the filtering process (With or Without
KEY).
system.time(mydata[origin %in% c("JFK", "LGA")])
system.time(mydata[c("JFK", "LGA")])
If you look at the real time in the image above, setting key makes filtering
twice as faster than without using keys.
Sorting Data
We can sort data using setorder() function, By default, it sorts data on
ascending order.
mydata01 = setorder(mydata, origin)
IF THEN ELSE
The 'IF THEN ELSE' conditions are very popular for recoding values. In
data.table package, it can be done with the following methods :
Method I : mydata[, flag:= 1*(min < 50)]
Method II : mydata[, flag:= ifelse(min < 50, 1,0)]
It means to set flag= 1 if min is less than 50. Otherwise, set flag =0.
Summary by group
Remove Duplicates
You can remove non-unique / duplicate cases with unique() function. Suppose
you want to eliminate duplicates based on a variable, say. carrier.
setkey(mydata, "carrier")
unique(mydata)
Suppose you want to remove duplicated based on all the variables. You can
use the command below setkey(mydata, NULL)
unique(mydata)
Note : Setting key to NULL is not required if no key is already set.
Merging / Joins
Sample Data
(dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A"))
(dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A"))
Inner Join
Left Join
It returns all observations from the left dataset and the matched
observations from the right dataset.
merge(dt1, dt2, by="A", all.x = TRUE)
Right Join
It returns all observations from the right dataset and the matched
observations from the left dataset.
merge(dt1, dt2, by="A", all.y = TRUE)
Full Join
It return all rows when there is a match in one of the datasets.
merge(dt1, dt2, all=TRUE)
Convert a data.table to
data.frame
You can use setDF() function to accomplish this task.
setDF(mydata)
Similarly, you can use setDT() function to convert data frame to data table.
set.seed(123)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE)
setDT(X, key = "A")
Rolling Joins
It supports rolling joins. They are commonly used for analyzing time series
data. A very R packages supports these kind of joins.
Endnotes
This package provides a one-stop solution for data wrangling in R. It offers
two main benefits - less coding and lower computing time. However, it's not
a first choice of some of R programmers. Some prefer dplyr package for its
simplicity. I would recommend learn both the packages. Check out dplyr
tutorial. If you are working on data having size less than 1 GB, you can use
dplyr package. It offers decent speed but slower than data.table package.
TRANSPOSE DATA IN R
Deepanshu Bhalla Add Comment R Tutorial
In R, we can transpose our data very easily. In R, there are many packages
such as tidyr and reshape2 that helps to make it easy. In this article, i would
use 'reshape2' package. This package was written by the most popular R
expert Hadley Wickham.
Sample Data
The code below would create a sample data that would be used for
demonstration.
data <- read.table(text="X Y Z
ID12 2012-06 566
ID1 2012-06 10239
ID6 2012-06 524
ID12 2012-07 2360
ID1 2012-07 13853
ID6 2012-07 2352
ID12 2012-08 3950
ID1 2012-08 14738
ID6 2012-08 4104",header=TRUE)
Suppose you have data containing three variables such as X, Y and Z. The
variable 'X' contains IDs and the variable 'Y' contains dates and the variable
'Z' contains income. The data is structured in a long format and you need to
convert it to wide format so that the dates moved to column names and the
income information comes under these dates. The snapshot of data and
desired output is shown below -
1.
The
2.
The
3.
The
4.
The
values.
library(reshape2)
xx=dcast(data, Year + SemiYear ~ Product, value.var = "Income")
In the above code, "Year + SemiYear" are the 2 ID variables. We want
"Product" variable to be moved to columns.
Output
ProductA
ProductB
ProductA
ProductB
27446
23176
22324
24881
1.
2.
3.
variable.name - name of variable used to store measured variable
names
4.
value.name - name of variable used to store values
LOOPS IN R
Deepanshu Bhalla Add Comment R Tutorial
Loops with R
What is Loop?
Loops helps you to repeat the similar operation on different variables or on
different columns or on different datasets. For example, you want to multiple
each variable by 5. Instead of multiply each variable one by one, you can
perform this task in loop. Its main benefit is to bring down the duplication in
your code which helps to make changes later in the code.
1. Apply Function
apply arguments
2. Lapply Function
When we apply a function to each element of a data structure and it returns
a list.
lapply arguments
Output
3. Sapply Function
Sapply is a user friendly version of Lapply as it returns a vector when we
apply a function to each element of a data structure.
Explanation :
1.
2.
index would return TRUE / FALSE whether the variable is factor or not
Converting only those variables wherein index=TRUE.
4. For Loop
Like apply family of functions, For Loop is used to repeat the same task on
multiple data elements or datasets. It is similar to FOR LOOP in other
languages such as VB, python etc. This concept is not new and it has been in
the programming field over many years.
do.call() applies a given function to the list as a whole. When it is used with
rbind, it would bind all the list arguments.
temp =list()
for (i in 1:length(unique(iris$Species))) {
series= data.frame(Species =as.character(unique(iris$Species))[i])
temp[[i]] =series
}
output = do.call(rbind, temp)
output
5. While Loop in R
A while loop is more broader than a for loop because you can rescript any for
loop as a while loop but not vice-versa.
In the example below, we are checking whether a number is an odd or even,
i=1
while(i<7)
{
if(i%%2==0)
print(paste(i, "is an Even number"))
else if(i%%2>0)
print(paste(i, "is an Odd number"))
i=i+1
}
The double percent sign (%%) indicates mod. Read i%%2 as mod(i,2). The
iteration would start from 1 to 6 (i.e. i<7). It stops when condition is met.
Output:
[1] "1 is an
[1] "2 is an
[1] "3 is an
[1] "4 is an
[1] "5 is an
[1] "6 is an
Odd number"
Even number"
Odd number"
Even number"
Odd number"
Even number"
Output :
[1] "i= 1 j= 3"
[1] "i= 1 j= 2"
[1] "i= 1 j= 1"
Next Keyword
When a loop encounters 'next', it terminates the current iteration and moves
to next iteration.
for (i in 1:3) {
for (j in 3:1) {
if ((i+j) > 4) {
next
} else {
print(paste("i=", i, "j=", j))
}
}
}
Output :
[1] "i= 1 j=
[1] "i= 1 j=
[1] "i= 1 j=
[1] "i= 2 j=
[1] "i= 2 j=
[1] "i= 3 j=
3"
2"
1"
2"
1"
1"
If you get confused between 'break' and 'next', compare the output of both
and see the difference.
ERROR HANDLING IN R
Deepanshu Bhalla Add Comment R Tutorial
R : CHARACTER FUNCTIONS
Deepanshu Bhalla Add Comment R Tutorial
This tutorial lists some of the most useful character functions in R. It includes
concatenating two strings, extract portion of text from a string, extract word from a
string, making text uppercase or lowercase, replacing text with the other text etc.
Character Functions in R
Basics
In R, strings are stored in a character vector. You can create strings with a single
quote / double quote.
Like is.character function, there are other functions such as is.numeric, is.integer
and is.array for checking numeric vector, integer and array.
3. Concatenate Strings
The paste function is used to join two strings. It is one of the most important
string manipulation task. Every analyst performs it almost daily to structure data.
Example 1
x = "Deepanshu"
y ="Bhalla"
paste(x, y)
Output : Deepanshu Bhalla
paste(x, y, sep = ",")
Output : Deepanshu,Bhalla
Output : "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10"
4. String Formatting
Suppose the value is stored in fraction and you need to convert it to percent.
The sprintf is used to perform C-style string formatting.
Other Examples
a = seq(1, 5)
sprintf("x%03d", a)
Output : "x001" "x002" "x003" "x004" "x005"
The letter 'd' in the format is used for numeric value.
sprintf("%s has %d rupees", "Ram", 500)
Output : "Ram has 500 rupees"
The letter 's' in the format is used for character string.
6. String Length
if ignore.case is FALSE, the pattern matching is case sensitive and if TRUE, case is
ignored during matching.
sub("okay", "fine", "She is okay.")
Output : She is fine
In the above example, we are replacing the word 'okay' with 'fine'.
Suppose you need to pull a first or last word from a character string.
Example
x = "I love R Programming"
library(stringr)
word(x, 1,sep = " ")
Output : I
In the example above , '1' denotes the first word to be extract from a string. sep=" "
denotes a single space as a delimiter (It's the default delimiter in the word function)
9. Convert Character to
Uppercase / Lowercase
/Propercase
In many times, we need to change case of a word. For example. convert the case to
uppercase or lowercase.
Examples
x = "I love R Programming"
tolower(x)
Output : "i love r programming"
The tolower() function converts letters in a string to lowercase.
toupper(x)
Output : "I LOVE R PROGRAMMING"
The toupper() function converts letters in a string to uppercase.
library(stringr)
str_to_title(x)
Output : "I Love R Programming"
The str_to_title() function converts first letter in a string to uppercase and the
remaining letters to lowercase.
Syntax :
trimws(x, which = c("both", "left", "right"))
Default Option : both : It implies removing both leading and trailing whitespace.
If you want to remove only leading spaces, you can specify "left". For removing
trailing spaces,specify "right".
bhalla"
Output : "xxx"
Output : "I"
"love"
"R"
"Programming"
x[!grepl("(?i)^d",x)]
Output : "Sandy" "Jades"
"Jades"
Sample Data
data = read.table(text="
XYZ
650
6 3 NA
615
853
1 NA 1
872
2 0 2", header=TRUE)
Apply Function
When we want to apply a function to the rows or columns of a matrix or data frame.
It cannot be applied on lists or vectors.
apply arguments
It returns NA if NAs exist in a row. To ignore NAs, you can use the following line of
code.
apply(data, 1, max, na.rm = TRUE)
The article below explains how to keep or remove variables (columns) from data
frame. In R, there are multiple ways to select or drop variables.
Sample Data
variables. Make sure the variable names would NOT be specified in quotes when
using subset() function.
df = subset(mydata, select = -c(x,z) )
Method II :
In this method, we are creating a character vector named drop in which we are
storing column names x and z. Later we are telling R to select all the variables
except the column names specified in the vector drop. The function names() returns
all the column names and the '!' sign indicates negation.
drop <- c("x","z")
df = mydata[,!(names(mydata) %in% drop)]
It can also be written like : df = mydata[,!(names(mydata) %in% c("x","z"))]
Method II :
We can keep variables with subset() function.
df = subset(mydata, select = c(x,z) )
This article describes how to keep or drop columns by their name pattern. With
regular expression, we can easily keep or drop columns whose names contain a
special keyword or a pattern. It is very useful when we have a hell lot of variables
and we need to select only those columns having same pattern.
ID
-1.250974459
1.234389053
0.796469701
-0.004735964
-0.729994828
Table II : DF2
df2 <- data.frame(ID = c(1, 7, 3, 6, 8),
a = c('z', 'b', 'k', 'd', 'l'),
b = c(1, 2, 3, 0, 4),
c =rnorm(5),
d =letters[2:6])
ID
0.9367346
-2.3464766
0.8259913
-0.8663029
-0.482444
Inner Join
df3 = merge(df1, df2, by ="ID")
If the primary key (matching variable) do not have same name in both the tables
(data frames),
df3 = merge(df1, df2, by.x ="ID", by.y="ID")
Left Join
df4 = merge(df1, df2, by ="ID", all.x = TRUE)
Right Join
df5 = merge(df1, df2, by ="ID", all.y = TRUE)
Cross Join
df7 = merge(df1, df2, by = NULL)
R : SUMMARIZE DATA
Sample Data
Examples
dat <- list( str='R', vec=c(1,2,3), bool=TRUE )
a = dat["str"]
a
class(a)
b = dat[["str"]]
b
class(b)
c = dat$str
c
class(b)
R Indexing Operators
Important Note
Both $ and [[ ]] works same. But it is advisable to use [[ ]] in functions and loops.
Sample Data
This tutorial explains how to convert data from wide to long format with R
programming.
Explanation :
1.
2.
3.
4.
Note : If you do not want to remove NA values, make na.rm = TRUE to na.rm =
FALSE.
In R, the which() function gives you the position of elements of a logical vector
that are TRUE.
Examples
1. which(letters=="z") returns 26.
In R, the file path must be in forward slash format. In Window OS, the file path is
placed in back slash format. Converting it to forward slash is a pain in the ass.
In R, there is a package named mailR that allows you to send emails from R.
This tutorial explains how to run sql queries in R with sqldf package.
as.numeric(test1[[1]])
sqldf("select MAX(sale_date) from test1")
To measure execution time of R code, we can use Sys.time function. Put it before
and after the code and take difference of it to get the execution time of code.
start.time <- Sys.time()
In R, you can convert multiple numeric variables to factor using lapply function.
The lapply function is a part of apply family of functions. They perform multiple
iterations (loops) in R. In R, categorical variables need to be set as factor variables.
Some of the numeric variables which are categorical in nature need to be
transformed to factor so that R treats them as a grouping variable.
In this case, we are converting two variables 'Credit' and 'Balance' to factor
variables.
names <- c('Credit' ,'Balance')
mydata[,names] <- lapply(mydata[,names] , factor)
str(mydata)
3. Converting all variables
In R, you can extract numeric and factor variables using sapply function.
In R, you can install packages directly from Github with simple 2-3 line of codes.
library(httr)
set_config(use_proxy(url="proxy.xxxx.com", port=80,
username="user",password="password"))
Parameters 1.
2.
This tutorial explains how to read large CSV files with R. I have tested this code upto
6 GB File.
The following code returns new dummy columns from a categorical variable.
DF <- data.frame(strcol = c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
for(level in unique(DF$strcol)){
DF[paste("strcol", level, sep = "_")] <- ifelse(DF$strcol == level, 1, 0)}
To have a different value against Y=1 and Y=0 for a categorical predictor, we can
adjust the average response value of the category,
Parameters of TransformCateg
Function
1.
y : Response or target or dependent variable - categorical or continuous
2.
x : a list of independent variables or predictors - Factor or Character Variables
3.
inputdata : name of input data frame
4.
cutoff : minimum observations in a category. All the categories having
observations less than the cutoff will be a different category.
In R, there is a package called caret which stands for Classification And REgression
Training. It makes predictive modeling easy. It can run most of the predive modeling
techniques with cross-validation. It can also perform data slicing and pre-processing
data modeling steps.
Explanation :
1.
repeatedcv : K-fold cross-validation
2.
number = 10 : 10-fold cross-validations
3.
repeats = 3 : three separate10-fold cross-validations are used.
4.
classProbs = TRUE : It should be TRUE if metric = " ROC " is used in the
train function. It can be skipped if metric = "Kappa" is used.
Note : Kappa measures accuracy.
1.
nearZeroVar: a function to remove predictors that are sparse and highly
unbalanced
2.
findCorrelation: a function to remove the optimal set of predictors to
achieve low pairwise correlations (Check out this link)
3.
preProcess: Variable selection using PCA
4.
predictors: class for determining which predictors are included in the
prediction equations (e.g. rpart, earth, lars models) (currently7 methods)
5.
confusionMatrix, sensitivity, specificity,
posPredValue, negPredValue:classes for assessing classifier performance
This article explains about useful functions of caret package in R. If you are new
to the caret package, check out Part I Tutorial.
Example
ctrl <- trainControl(method = "cv", classProbs = TRUE, summaryFunction =
twoClassSummary, number = 5)
Note : In the above syntax, "text mining" is a folder name. I have placed all text
files in this folder
docs<-Corpus(VectorSource(cname[,1]));
If imported multiple files from a folder
tdm = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)