0% found this document useful (0 votes)
18 views156 pages

DAR Programming - An Approach to Data Analytics-1

The document provides a comprehensive guide on R programming, covering its basics, data types, data preparation, graphics, statistical analysis, and data mining. It includes instructions on installing R and R-Studio, writing simple programs, and utilizing various functions and mathematical operations. Additionally, it discusses the use of packages in R and offers insights into its flexibility and capabilities for data analytics.

Uploaded by

himeshss6105
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
18 views156 pages

DAR Programming - An Approach to Data Analytics-1

The document provides a comprehensive guide on R programming, covering its basics, data types, data preparation, graphics, statistical analysis, and data mining. It includes instructions on installing R and R-Studio, writing simple programs, and utilizing various functions and mathematical operations. Additionally, it discusses the use of packages in R and offers insights into its flexibility and capabilities for data analytics.

Uploaded by

himeshss6105
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 156

Contents

Chapter 1 Basics of R 1
Chapter 2 Data Types in R 27
Chapter 3 Data Preparation 83
Chapter 4 Graphics using R 117
Chapter 5 Statistical Analysis Using R 141
Chapter 6 Data Mining Using R 177
Chapter 7 Case Studies 233
Glossary 299
Packages Used 309
Functions Used 313
References 359
Books 359
Websites 359
Index 361

Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)


CHAPTER 1

Basics of R

 OBJECTIVES

On completion of this Chapter you will be able to:


Ÿ understand how R is different from other languages
Ÿ install R in your system
Ÿ write a beginners program using R
Ÿ get help in R
Ÿ assign variables in R
Ÿ know the basic mathematical operations in R
Ÿ understand various environments and scope of variables
Ÿ understand functions in R
Ÿ understand program flow control in R
Ÿ understand loops in R

1.1. Introducing R
R is a Programming Language and R also refers to the software that is used to run
the R programs. Ross Ihaka and Robert Gentleman from University of Auckland
created R language in 1990s. R language is based on the S language. S Language
was developed at the Bell Laboratories in 1970s. S Language was developed by John
Chambers. R Software is a GNU project free and open source software. R (Language
and Software) is developed by the R Core Team. R has evolved over the past 3 to 4
decades as its history originated from 1970s.
2 R Programming — An Approach for Data Analytics

One can write a new package in R if the existing package is not sufficient
for the individual’s use. R is a high-level scripting language which need not be
compiled, but it is an interpreted language. R is an imperative language and still
it supports object-oriented programming.

R is a free open source language that has cross platform compatibility. R is a


most advanced statistical programming language and it can produce outstanding
graphical outputs. R is extremely flexible and comprehensive even for the beginners.
R easily relates to other programming languages such as C, C++, Java, Python,
Hadoop, etc. R can handle huge data in flat files even in semi structured or in
unstructured form.

The R language allows the user to program loops to successively analyze several
data sets. It is also possible to combine in single program different statistical
functions to perform more complex analyses. The R users may get benefitted from
a large number of programs written and available on the internet. At first R can
look very complex for a beginner or non-specialist. But, this is not actually true as
the prominent feature of R is its flexibility. R displays the results of the analysis
immediately and these results are stored in “objects” so that further analysis can be
done on them. The user can also extract a part of the result which is of interest to
him.

Looking at the features of R, some users may think that “I can’t write programs
using R”. But, this is not the case for two reasons. First, R is an interpreted language
and not a compiled one. This means that all commands typed on the keyboard are
directly executed without need to build the complete program like in C, C++ or
Java. Second, R’s syntax is very simple and intuitive.

In R, a function is always written with parentheses, eg. ls(). If only the name
of the function is typed, R displays the content of the function. In this book the
functions are written with their names followed by parentheses to distinguish them
from other objects. When R is running variables, data, functions, results, etc. are
stored in the active memory of the computer in the form of objects which have a
name. The user can do actions on these objects with operators and functions.
3 Basics of R

1.2. Installing R
R is available in several forms, essentially for Unix and Linux machines, or some
pre-compiled binaries for Windows, Linux and Macintosh. The files needed to
install R, either from the source or from the pre-compiled binaries are distributed
from the internet site of the Comprehensive R Archive Network (CRAN) where the
instructions for installation are also available.

R can be installed from the link http://www.r-project.org using internet


connection. Use the “Download R” link in web page to download the R Executable.
Choose the version of R that is suitable for your operating system. R-Scripts can
run without the installation of the IDE, the R-Studio using the R-Console. The
prerequisite for installing R-Studio is that one should have downloaded and installed
any version of R. (Version 3.3.0 of R is used for installation and running scripts used
in this book). Follow the instructions on the website to complete installation of R
Console.

Once R installation is completed we install R-Studio. For installation of


R-Studio in Windows operating system, we download the latest precompiled
binary distribution from the CRAN website http://www.rstudio.org. (Version 3.4 of
R-Studio is used for installation and running scripts used in this book). Start the
installation and follow the steps required by the setup wizard. Once completed,
launch RStudio IDE from Start à All Programs à Rstudio à RStudio.exe or from your
custom installation directory. The default installation directory for RStudio IDE is
“C:\Program Files\RStudio\bin\rstudio.exe.

R Studio is an Integrated Development Environment (IDE) that consists of


a GUI with four parts – 1) A text editor 2) command-line interpreter 3) place to
display files, plots, packages and help information 4) place to display the data being
used and the variables used in the program (Environment / History).
4 R Programming — An Approach for Data Analytics

Figure 1.1 R-Studio GUI

1.3. Initiating R

1.3.1. First Program

Open R Gui, find the command prompt and type the command below and hit enter
to run the command.
> sum(1:5)
[1] 15

The result above shows that the command gives the result 15. That is the
command has taken the input of integers from 1 to 5 and has performed the sum
operation on them. In the above command sum() is a function that takes the
argument 1:5 which means a vector that consists of a sequence of integers from 1 to
5. Like any other command prompt, R also allows to use the up arrow key to revoke
the previous commands.

1.3.2. Help in R

There are many ways to get help from R. If a function name or a dataset name is
known then we can type ? followed by the name. If name is not known then we
5 Basics of R

need to type ?? followed by a term that is related to the search function. Keywords,
special characters and two separate terms of search need to be enclosed in double or
single quotes. The symbol # is used to comment a line in R Program like any other
programming language.
> ?mean # help page for mean function opens
> ?”+” # help page for addition function opens
> ?”if ” # help page for if opens
> ??plotting # searches for the help pages containing the word “plotting”
> ??”regression model” # searches for “regression model” phrase

The same help can be obtained by the functions help() and help.search(). In
these functions the arguments has to be enclosed by quotes.
> help(“mean”)
> help(“+”)
> help(“if ”)
> help.search(“plotting”)
> help.search(“regression model”)

1.3.3. Assigning Variables


The results of the operations in R can be stored for reuse. The values can be assigned
to the variables using the symbol “<-” or “=” of which the symbol “<-” is preferred.
There is no concept of variables declaration in R. The variable type is assumed
based on the value assigned.
> X <- 1:3
>X
[1] 1 2 3
> Y = 4:6
>Y
[1] 4 5 6
>X+3*Y-2
[1] 11 15 19
6 R Programming — An Approach for Data Analytics

The variable names consist of letters, numbers, dots and underscores, but a
variable name should only start with an alphabet. The variable names should not
be reserve words. To create global variables (variables available everywhere) we use
the symbol “<<-”.

X <<- exp(exp(1))

Assignment operation can also be done using the assign() function. For global
assignment the same function assign() can be used, but, by including an extra
attribute globalenv(). To see the value of the variable, simply type the variable in
the command prompt. The same thing can be done using a print() function.
> assign(“F”, 3 * 8)
> assign(“G”, 6 * 9, globalenv())
>F
[1] 24
> print(G)
[1] 54

If assignment and printing of a value has to be done in one line we can do the
same in two ways. First method, by separating the two statements by a semicolon
and the second method is by wrapping the assignment in parenthesis () as below.
> L <- sum(4:8); L
[1] 30
> (M <- sum(5:9))
[1] 35

1.3.4. Basic Mathematical Operations

The “+” plus operator is used to perform the addition operation. It can be used
to add two numbers or add two vectors. Vector represents an ordered set of values.
Vectors are mainly used to analyse statistical data. The “:” colon operator creates
a sequence. Sequence is a series of numbers within the given limits. The “c()”
function concatenates the values given within the brackets “(“ and “)”. Variable
7 Basics of R

names in R are case sensitive. Open R Gui, find the command prompt and type the
command below and hit enter to run the command.
> 7:12 + 12:17
[1] 19 21 23 25 27 29
> c(3, 1, 8, 6, 7) + c(9, 2, 5, 7, 1)
[1] 12 3 13 13 8

The vectors and the c() function in R helps us to avoid loops. The statistical
functions in R can take the vectors as input and produce results. The sum() function
takes vector arguments and produces results. But, the median() function when
taking the vector arguments shows errors.
> sum(7:10)
[1] 34
> mean(7:10)
[1] 8.5
> median(7:10)
[1] 8.5
> sum(7,8,9,10)
[1] 34
> mean(7,8,9,10)
[1] 7
> median(7,8,9,10)
Error in median(7, 8, 9, 10) : unused arguments (9, 10)

Similar to the “+” plus operator all other operators in R take vectors as inputs
and can produce results. The subtraction and the multiplication operations work
as below.
> c(5, 6, 1, 9) - 2
[1] 3 4 -1 7
> c(5, 6, 1, 9) - c(4, 2, 0, 7)
8 R Programming — An Approach for Data Analytics

[1] 1 4 1 2
> -1:4 * -2:3
[1] 2 0 0 2 6 12
> -1:4 * 3
[1] -3 0 3 6 9 12

The exponentiation operator is represented using the symbol “^” or the “**”.
This can be checked using the function identical().
> identical(2^3, 2**3)
[1] TRUE

The division operator is of three types. The ordinary division is represented


using the “/” symbol, the integer division operator is represented using the “%/%”
symbol and the modulo division operator is represented using the “%%” symbol.
The below example commands show the results of the division operators.
> 5:9/2
[1] 2.5 3.0 3.5 4.0 4.5
> 5:9%/%2
[1] 2 3 3 4 4
> 5:9%%2
[1] 1 0 1 0 1

The other mathematical functions are the trigonometry functions like, sin(),
cos(), tan(), asin(), acos(), atan() and the logarithmic and exponential functions
like log(), exp(), log1p(), expm1(). All these mathematical functions can operate on
vectors as well as individual elements. Few more examples of the mathematical
functions are listed below

The operator “==” is used for comparing two values. For checking inequalities
of values the operator “!=” is used. These operators are called the relational
operators. The relational operators also take the vectors as input and operate on
them. The other relational operators are the “< “, “> “, “<= “ and “>= “.
> c(2, 4 - 2, 1 + 1) == 2
9 Basics of R

[1] TRUE TRUE TRUE


> 1:5 != 5:1
[1] TRUE TRUE FALSE TRUE TRUE
> exp(1:3) < 20
[1] TRUE TRUE FALSE
> (1:10) ^ 2 >= 50
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE

Non-integers cannot be compared using the operator “==” as it produces


wrong results due to rounding off error of the float numbers being compared. For
overcoming this issue we have the function all.equal(). If the value to be compared
by the function all.equal() is not equal, it returns a report on the difference. To
get a TRUE or FALSE reply, the all.equal() function has to be wrapped using the
function isTRUE(). The below example code will help to understand the concepts
discussed.
> sqrt(2) ^ 2 == 2
[1] FALSE
> sqrt(2) ^ 2 - 2
[1] 4.440892e-16
> all.equal(sqrt(2) ^ 2, 2)
[1] TRUE
> all.equal(sqrt(2) ^ 2, 3)
[1] “Mean relative difference: 0.5”
> isTRUE(all.equal(sqrt(2) ^ 2, 3))
[1] FALSE

The equality operator “==” can also be used to compare strings, but, string
comparison is case sensitive. Similarly, the operators “<” and “>” can also be used
on strings. The below examples show the results.
> c(“Week”, “WEEK”, “week”, “weak”) == “week”
[1] FALSE FALSE TRUE FALSE
10 R Programming — An Approach for Data Analytics

> c(“A”, “B”, “C”) < “B”


[1] TRUE FALSE FALSE
> c(“a”, “b”, “c”) < “B”
[1] TRUE TRUE FALSE

1.4. Packages in R
R Packages are installed in an online repository called CRAN (Comprehensive R
Archive Network). A Package is a collection of R functions and datasets. Currently,
the CRAN package repository features 10756 available packages. The list of all
available packages in the CRAN repository can be viewed from the web site “https://
cran.r-project.org/web/packages/available_packages_by_name.html”. To find the
list of functions available in a package (say the package is “stats”) we can use the
command ls(“package:stats”) or the command library(help = stats) in the command
prompt.

A library is a folder in the machine that stores the files for a package. If a package
is already installed on a machine we can load the same using the library() function.
The name of the package to be loaded is passed to the library() function as argument
without enclosing in quotes. If the package name has to be programmatically passed
to the library() function, then we need to set the argument character.only = TRUE.
If a package is not installed and if the library() function is used to load the package,
it will throw an error message. Alternatively if the require() function is used to load
a package, it returns TRUE if the package is already installed or it returns FALSE if
the package is not already installed.

We can list and see all the packages that are already loaded using the search()
function. This list shows the global environment as the first one followed by the
recently loaded packages. The last two are special environments, namely, “Autoloads”
and “base” package.
> search()
[1] “.GlobalEnv” “package:cluster” “tools:rstudio” “package:stats”
[5] “package:graphics” “package:grDevices” “package:utils” “package:datasets”
[9] “package:methods” “Autoloads” “package:base”
11 Basics of R

The function installed.packages() returns a data frame with information about


all the packages installed in a machine. It is safe to view the results of this using
the View() function as it may list hundreds of packages. This list of packages also
shows the version of the package installed, location on the machine and dependent
packages.
> View(installed.packages())

The function R.home(“library”) retrieves the location on the machine that


stores all R default packages. The same result can be accomplished using the .Library
command. The home directory can be listed using the path.expand(“~”) and Sys.
getenv(“HOME”) functions.
> R.home(“library”)
[1] “C:/PROGRA~1/R/R-33~1.0/library”
> .Library
[1] “C:/PROGRA~1/R/R-33~1.0/library”
> path.expand(“~”)
[1] “C:/Users/admin/Documents”
> Sys.getenv(“HOME”)
[1] “C:/Users/admin/Documents”

When R is upgraded, it is required to reinstall all the packages as different


versions of R needs different versions of the packages. The function .libPaths() lists
all the R libraries in the installed machine. The first value listed is the place where
the packages will be installed by default.
> .libPaths()
[1] “C:/Users/admin/Documents/R/win-library/3.3”
[2] “C:/Program Files/R/R-3.3.0/library”

The CRAN package repository contains handful of packages that needs special
attention. To access additional repositories, type setRepositories() and select
the repository required. The repositories R-Forge and rforge.net contains the
development versions of the packages that appear on the CRAN repository. The
function available.packages() lists thousands of packages in each of the selected
12 R Programming — An Approach for Data Analytics

repository. (Note: can use the View() function to restrict fetching of thousands of
the packages at one go)
> setRepositories()
--- Please select repositories for use in this session ---
1: + CRAN
2: BioC software
3: BioC annotation
4: BioC experiment
5: BioC extra
6: CRAN (extras)
7: Omegahat
8: R-Forge
9: rforge.net
10: + CRANextra
Enter one or more numbers separated by spaces, or an empty line to cancel
1:

There are many online repositories like GitHub, Bitbucket, and Google Code
from where many R Packages can be retrieved. The packages can be installed using
the function install.packages() function by mentioning the name of the package as
argument to this function. But, it is necessary to have internet connection to install
any package and write permission to the hard drive. To update the latest version of
the installed packages, we use the function update.packages() with the argument
ask = FALSE which disallows prompting before updating each package. To delete
a package already installed, we use the function remove.packages() by passing the
name of the package to be removed as argument.
> install.packages(“chron”)
13 Basics of R

1.5. Environments and Functions

1.5.1. Environments

In R the variables that we create need to be stored in an environment. Environments


are another type of variables. We can assign them, manipulate them and pass
them as arguments to functions. They are like lists that are used to store different
types of variables. When a variable is assigned in the command prompt, it goes by
default into the global environment. When a function is called, an environment is
automatically created to store the function-related variables. A new environment is
created using the function new.env().
> newenvironment <- new.env()

We can assign variables into a newly created environment using the double
square brackets or the dollar operator as below.
> newenvironment[[“variable1”]] <- c(4, 7, 9)
> newenvironment$variable2 <- TRUE
> assign(“variable3”, “Value for variable3”, newenvironment)

The assign() function can also be used to assign variables to an environment.


Retrieving values stored in an environment is like list indexing or we can use the
get() function.
> newenvironment[[“variable1”]]
[1] 4 7 9
> newenvironment$variable2
[1] TRUE
> get(“variable3”, newenvironment)
[1] “Value for variable3”

The functions ls() and ls.str() take an environment argument and lists its
contents. We can test if a variable exists in an environment using the exists()
function.
14 R Programming — An Approach for Data Analytics

> ls(envir = newenvironment)


[1] “variable1” “variable2” “variable3”
> ls.str(envir = newenvironment)
variable1 : num [1:3] 4 7 9
variable2 : logi TRUE
variable3 : chr “Value for variable3”
> exists(“variable2”, newenvironment)
[1] TRUE

An environment can be converted into a list using the function as.list() and a
list can be converted into an environment using the function as.environment() or
the function list2env().
> newlist <- as.list(newenvironment)
> newlist
$variable3
[1] “Value for variable3”
$variable1
[1] 4 7 9
$variable2
[1] TRUE
> as.environment(newlist)
<environment: 0x124730a8>
> list2env(newlist)
<environment: 0x12edf3e8>
> anotherenv <- as.environment(newlist)
> anotherenv[[“variable3”]]
[1] “Value for variable3”

All environments are nested and so every environment has a parent environment.
The empty environment sits at the top of the hierarchy without any parent. The
15 Basics of R

exists() and the get() function also looks for the variables in the parent environment.
To change this behaviour we need to pass the argument inherits = FALSE.
> subenv <- new.env(parent = newenvironment)
> exists(“variable1”, subenv)
[1] TRUE
> exists(“variable1”, subenv, inherits = FALSE)
[1] FALSE

The word frame is used interchangeably with the word environment. The function
to refer to parent environment is denoted as parent.frame(). The variables assigned
from the command prompt are stored in the global environment. The functions and
the variables from the R’s base package are stored in the base environment.

1.5.2. Functions
A function and its environment together is called a closure. When we load a
package, the functions in that package are stored in the environment on the search
path where the package is installed. A function in R is a verb and not a noun as it
does things with its data. Functions are also another data types and hence we can
assign and manipulate and pass them as arguments to other functions. Typing the
function name in the command prompt lists the code associated with the function.
Below is the code listed for the function readLines().
> readLines
function (con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = unknown”,
skipNul = FALSE)
{
if (is.character(con)) {
con <- file(con, “r”)
on.exit(close(con))
}
.Internal(readLines(con, n, ok, warn, encoding, skipNul))
}
16 R Programming — An Approach for Data Analytics

When we call a function by passing values to it, the values are called as
arguments. The lines of code of the function can be seen between the curly braces
as body of the function. In R, there is no explicit return statement to return values.
The last value that is calculated in a function is returned by default in R.

To create user defined functions, it is required to just assign the function as we


do for other variables. The below code is an example of how to create a user defined
function. In this cube is the name of the function and x is the argument passed
to this function. The content within the curly braces is the body of the function.
(Note: If it is a one line code we can omit the curly braces). Once a function is
defined, it can be called like any other function in R by passing its arguments.
> cube <- function(x)
{ cu <- x ^ 3}
> z <- cube(5)
>z
[1] 125

The functions formals(), args() and formalArgs() can fetch the arguments
defined for a function. The body of the function can be retrieved using the body()
and deparse() functions.
> formals(cube)
$x
> args(cube)
function (x)
NULL
> formalArgs(cube)
[1] “x”
> body(cube)
{
cu <- x^3
}
17 Basics of R

> deparse(cube)
[1] “function (x) “ “{“ “ cu <- x^3” “}”

Functions can be passed as arguments to other functions and they can be


returned from other functions. For calling a function, there is another function
called do.call() in which we can pass the function name and its arguments as
arguments. The use of this function can be seen below when using the rbind()
function to concatenate two data frames.
> f1 <- data.frame(x = 1:4, y = 5:8)
> f2 <- data.frame(x = 9:12, y = 13:16)
> do.call(rbind, list(f1, f2))
x y
1 1 5
2 2 6
3 3 7
4 4 8
5 9 13
6 10 14
7 11 15
8 12 16

When using functions as arguments to the do.call() function, it is not necessary


to assign them first. We can pass a function anonymously as below.
> do.call(function(x,y) x * y, list(1:3, 4:6))
[1] 4 10 18

1.5.3. Variable Scope


Variable’s scope is the place where we can see the variable. If a variable is defined
within a function, the variable can be accessed from any statement in the function.
Also the sub-functions will have access to the variables defined in the parent
function.
18 R Programming — An Approach for Data Analytics

> x <- function(a1)


{
a2 <- 1
y <- function(a1)
{
a2 / a1
}
y(a1)
}
> x(5)
[1] 0.2

Thus R will search for a variable in the current environment and if it could not
find it, it will check the same in its parent environment. This search will proceed
upwards until the variable is searched in the global environment. The variables
defined in the global environment are called the global variables, which can be
accessed from anywhere else. The replicate() function can be used to run a function
several times as below. In this the user defined function random() returns 1 if the
value returned by the rnorm() function is a positive value and otherwise it returns
the value of the argument passed to the function random(). This function random()
is called 20 times using the replicate() function.
> random <- function(x)
+{
+ if(rnorm(1) > 0)
+ {r <- 1}
+ else
+ {r <- x}
+}
> replicate(20, random(5))
[1] 5 5 1 1 5 1 5 5 5 5 5 5 5 5 5 1 1 5 1 5
19 Basics of R

1.6. Flow Control


In some situations it may be required to execute some code only if a condition is
satisfied.

1.6.1. If and Else Statement

The if statement takes a logical value and executes the next statement only if the
value is TRUE.
> if(TRUE) message(“TRUE Statement”)
TRUE Statement
> if(FALSE) message(“FALSE Statement”)

It is not necessary to pass the logical values as TRUE or FALSE directly, instead
a variable or expression that returns a logical value can be used. If there are several
statements to execute after the condition, they can be wrapped in curly braces.
a <- 5
if(a < 7)
{
b <- a * 5
c <- b * 3
message(“b is “, b)
message(“c is “, c)
}
b is 25
c is 75

In the if and else construct the code that follows the if statement is executed if
the condition is TRUE and the code that follows the else statement is executed if
the condition is FALSE. It is important to note that the else statement must occur
on the same line as the closing curly brace of the if statement and otherwise it will
throw an error message.
20 R Programming — An Approach for Data Analytics

a <- 8
if(a < 7)
{
b <- a * 5
c <- b * 3
message(“b is “, b)
message(“c is “, c)
} else
{
message(“a is greater than 7”)
}
a is greater than 7

The if and else statements can be used repeatedly to code multiple conditions
and this respective actions. In this case it is important to note that the if and the
else statements are separated and they are not one word as ifelse. The ifelse function
is of different use which will be covered shortly.
a <- -8
if(a < 0)
{
message(“a is negative”)
} else if(a == 0)
{
message(“a is zero”)
} else if(a > 0)
{
message(“a is positive”)

a is negative
21 Basics of R

The ifelse() function takes three arguments of which the first is logical condition,
the second is the value that is returned when the first vector is TRUE and the third
is the value that is returned when the first vector is FALSE.
> a <- 3
> b <- 5
> ifelse(a < b, “a is less than b”, “a is greater than b”)
[1] “a is less than b”

1.6.2. Switch Statement

If there are many else statements, it looks confusing and in such cases the switch()
function is required. The first argument of the switch statement is an expression that
can return a string value or an integer. This is followed by several named arguments
that provide the results when the name matches the value of the first argument.
Here also we can execute multiple statements enclosed by curly braces. If there is
no match the switch statement returns NULL. So, in this case, it is safe to mention
a default value if none matches.
> switch(“color”,”color” = “red”, “shape” = “circle”, “radius” = 10)
[1] “red”
> switch(“position”,”color” = “red”, “shape” = “circle”, “radius” = 10)
[1] NULL
> switch(“position”,”color” = “red”, “shape” = “circle”, “radius” = 10,”default”)
[1] “default”
> switch(2,”red”,”green”,”blue”)
[1] “green”

1.7. Loops
There are three kinds of loops in R namely, repeat, while and for.
22 R Programming — An Approach for Data Analytics

1.7.1. Repeat Loops

The repeat is the easiest loop in R that executes the same code until it is forced to
stop. This repeat is similar to the do while statement in other languages. A break
statement can be given when it is required to break the looping. Also, it is possible
to skip the rest of the statements in a loop and execute the next iteration and this
is done using the next statement.
a <- 1
repeat {
message(“Inside the loop”)
if(a == 3)
{
a=a+1
next
}
message(“The value of a is “, a)
a=a+1
if(a > 5)
{
message(“Exiting the loop”)
break
}
}
Inside the loop
The value of a is 1
Inside the loop
The value of a is 2
Inside the loop
Inside the loop
23 Basics of R

The value of a is 4
Inside the loop
The value of a is 5
Exiting the loop

1.7.2. While Loops

The while loops are backward repeat loops. The repeat loop executes the code and
then checks for the condition, but in while loops the condition is first checked
and then the code is executed. So, in this case it is possible that the code may not
be executed even once when the condition fails at the entry itself during the first
iteration. The same example above can be written using the while statement.
a <- 1
while (a <= 5)
{
message(“Inside the loop”)
if(a == 3)
{
a=a+1
next
}
message(“The value of a is “, a)
a=a+1
}
Inside the loop
The value of a is 1
Inside the loop
The value of a is 2
Inside the loop
Inside the loop
24 R Programming — An Approach for Data Analytics

The value of a is 4
Inside the loop
The value of a is 5

1.7.3. For Loops

The for loops are used when we know how many times the code needs to be repeated.
The for loop accepts an iterating variable and a vector. It repeats the loop giving the
iterating each element from the vector in turn. In this case also if there are multiple
statements to execute, we can use the curly braces. The iterating variable can be an
integer, number, character or logical vectors and they can be even lists.
for(i in 1:5)
{
j <- i * i
message(“The square value of “, i, “ is “, j)
}
The square value of 1 is 1
The square value of 2 is 4
The square value of 3 is 9
The square value of 4 is 16
The square value of 5 is 25
for(i in c(TRUE, FALSE, NA))
{
message(“This Statement is “, i)
}
This Statement is TRUE
This Statement is FALSE
This Statement is NA
a <- c(1,2,3)
b <- c(“a”,”b”,”c”,”d”)
25 Basics of R

d <- c(TRUE, FALSE)


l = list(a, b, d)
for(i in l)
{
message(“The value of the list is “, i)
}
The value of the list is 123
The value of the list is abcd
The value of the list is TRUEFALSE

 HIGHLIGHTS
 R is a free open source language that has cross platform compatibility.
 R’s syntax is very simple and intuitive.
 R’s installation software can be downloaded from the CRAN Website.
 Help in R can be obtained by using, for eg. ?mean() / help(“mean”)
 Variables can be assigned using the symbol ß or the assign() function.
 The basic functions are c(), sum(), mean(), median(), exp(), sqrt() etc.
 The basic operators are “+”, “*”, “:”, “/”, “**”, “*”, “%%”, “%/%”, “==”,
“!=”, “<”, “>”, “<=”, “>=” etc.
 Currently, the CRAN package repository features 10756 available packages.
 A Package can be newly installed using the function install.packages() and
it can be invoked using the function library().
 When a variable is assigned in the command prompt, it goes by default
into the global environment.
 To create a new environment we use the function new.env().
 Typing the function name in the command prompt lists the code
associated with the function.
 The if and the else statements are separated and they are not one word as
ifelse.
 The ifelse() function takes three arguments.
 If there are many else statements, the switch() function is required.
26 R Programming — An Approach for Data Analytics

 The function repeat is similar to the do while statement in other languages.


 A break statement can be given when it is required to break the looping.
 To skip the rest of the statements in a loop we use the next statement.
 The while loops are backward repeat loops.
 The for loops are used when we know how many times the code needs to
be repeated.
CHAPTER 2

Data Types in R

 OBJECTIVES

On completion of this Chapter you will be able to:


Ÿ know the basic data types in R of which the other complex data types are
made of
Ÿ know how to create, access and perform basic operations on the vector
data types in R
Ÿ know how to create, access and perform basic operations on matrices and
arrays in R
Ÿ know how to create, access and perform basic operations on the list data
types
Ÿ know how to create, access and perform basic operations on the factor
data types in R
Ÿ know how to create, access and perform basic operations on strings in R
Ÿ understand the various date and time classes in R
Ÿ convert between various date formats
Ÿ setup various time zones
Ÿ perform calculations on dates and times

2.1. Basic Data Types in R


In contrast to other programming languages like C and Java, in R, the variables are
not declared as some data type. The variables are assigned with R-Objects and the
data type of the R-objects becomes the data type of the variables. There are many
28 R Programming — An Approach for Data Analytics

types of R-objects. The frequently used ones are − Vectors, Arrays, Matrices, Lists,
Data Frames, Strings and Factors.

The simplest of these objects is the Vector object and there are six data types
of these atomic vectors, also termed as six classes of vectors. The other R-Objects
are built upon the atomic vectors. Hence, the basic data types in R are Numeric,
Integer, Complex, Logical and Character.

2.1.1. Numeric

Decimal values are called numeric in R. It is the default computational data type. If
we assign a decimal value to a variable x as follows, x will be of numeric type.
> x = 10.5       
> x        
[1] 10.5
> class(x)       # print the class name of x
[1] “numeric”

Further more, even if we assign an integer to a variable k, it is still being saved as


a numeric value. The fact that if k is an integer can be confirmed with the is.integer()
function.
>k=1
>k
[1] 1
> class(k)
[1] “numeric”
> is.integer(k)
[1] FALSE

2.1.2. Integer

In order to create an integer variable in R, the as.integer() function is invoked as


below.
29 Data Types in R

> y = as.integer(3)
>y
[1] 3
> class(y)
[1] “integer”
> is.integer(y)
[1] TRUE

We can force a numeric value into an integer with the same as.integer() function
as below.
> as.integer(3.14)
[1] 3

Similarly we can parse a string for a decimal value as below.


> as.integer(“5.27”) # force a decimal string
[1] 5

But, if a non decimal string is forced, it is an error and it returns NA.


> as.integer(“abc”)
[1] NA
Warning message:
NAs introduced by coercion

The integer values of the logical values TRUE and FALSE are 1 and 0 respectively.
> as.integer(TRUE)
[1] 1
> as.integer(FALSE)
[1] 0

2.1.3. Complex

A complex number is expressed as an imaginary value i.


30 R Programming — An Approach for Data Analytics

> z = 3 + 4i
>z
[1] 3 + 4i
> class(z)
[1] “complex”

If we find the square root of -1, it gives an error. But, if it is converted into a
complex number and then square root is applied, it produces the necessary result as
another complex number.
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
> sqrt(as.complex(-1))
[1] 0+1i

2.1.4. Logical

When two variables are compared, the logical values are created. The logical
operators are “&” (and), “|” (or), and “!” (negation).
> a = 4; b = 7
>p=a>b
>p
[1] FALSE
> class(p)
[1] “logical”
> a = TRUE; b = FALSE
>a&b
[1] FALSE
>a|b
31 Data Types in R

[1] TRUE
> !a
[1] FALSE

2.1.5. Character

The character object is used to represent string values in R. Objects can be converted
into character values using the as.character() function. A paste() function can be
used to concatenate two character values.
> s = as.character(“7.48”)
>s
[1] “7.48”
> class(s)
[1] “character”
> fname = “Adam”
> lname = “Smith”
> paste(fname, lname)
[1] “Adam Smith”

However, a readable string can be created using the sprint() function and this is
similar to the C language syntax.
> sprintf(“%s has %d rupees”, “Sundar”,1000)
[1] “Sundar has 1000 rupees”

The substr() function can be used to extract a substring from a given string. The
sub() function is used to replace the first occurrence of a string with another string
as below.
> substr(“Twinkle Twinkle Little Star”, start = 9, stop = 15)
[1] “Twinkle”
> sub(“Twinkle”, “Wrinkle”, “Twinkle Twinkle Little Star”)
[1] “Wrinkle Twinkle Little Star”
32 R Programming — An Approach for Data Analytics

2.2. Vectors
A sequence of data elements of the same basic type is called a Vector. Members in a
vector are called as components or members. The vector() function creates a vector
of a specified type and length. The result is a zero or FALSE or empty string.
> vector(“numeric”, 3)
[1] 0 0 0
> vector(“logical”, 5)
[1] FALSE FALSE FALSE FALSE FALSE
> vector(“character”, 2)
[1] “” “”

The below commands also produces the same result as the above commands.
> numeric(3)
[1] 0 0 0
> logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
> character(2)
[1] “” “”

The seq() function allows to generate sequences. The function seq.int() also
creates sequence from one number to another, but this function provides more
options for splitting the sequence.
> seq(1:5)
[1] 1 2 3 4 5
> seq.int(5, 12)
[1] 5 6 7 8 9 10 11 12
> seq.int(10, 5, -1.5)
[1] 10.0 8.5 7.0 5.5
33 Data Types in R

The function seq_len() creates a sequence from 1 to the input value. The
function seq_along() creates a sequence from 1 to the length of the input.
> seq_len(7)
[1] 1 2 3 4 5 6 7
> p <- c(3, 4, 5, 6)
> seq_along(p)
[1] 1 2 3 4

The function length() can be used to find the length of the vector, that is the
number of elements in a vector. Using this function, it is possible to assign new
length to a vector. If the vector length is extended NA(s) will be added to the end.
> length(1:7)
[1] 7
> length(c(“aa”, “ccc”, “eeee”))
[1] 3
> nchar(c(“aa”, “ccc”, “eeee”))
[1] 2 3 4
> s <- c(1,2,3,4,5)
> length(s) <- 3
>s
[1] 1 2 3
> length(s) <- 8
>s
[1] 1 2 3 NA NA NA NA NA

Each element of a vector can be given a name during the vector creation itself. If
there are space or special characters in the name, it needs to be enclosed in quotes.
The names() function can be used to give names to the vector elements after its
creation.
34 R Programming — An Approach for Data Analytics

> c(a = 1, b = 2, c = 3)
abc
123
> s <- 1:3
>s
[1] 1 2 3
> names(s) <- c(“a”, “b”, “c”)
>s
abc
123

Elements of a vector can be accessed using its indexes which are specified in
a square bracket. The index number starts from 1 and not 0. Specifying a negative
number as index to a vector means, it returns all the elements except the one
specified. The name of the vector element can also be specified as index to fetch it.
> x <- c(1:5)
>x
[1] 1 2 3 4 5
> x[c(2,3)]
[1] 2 3
> x[c(-1,-4)]
[1] 2 3 5
> s <- 1:3
>s
[1] 1 2 3
> names(s) <- c(“a”, “b”, “c”)
> s[“b”]
b
2
35 Data Types in R

If an incorrect index is specified to access a vector element, the result is NA.


Non integer indices are rounded off. Not passing any index to a vector will return all
the elements of the vector.
>x
[1] 1 2 3 4 5
> x[7]
[1] NA

The which() function returns the elements of the vector which satisfies the
condition specified within this function. The functions which.min() and which.
max() can be used to display the minimum and the maximum elements in the
vector.
>x
[1] 1 2 3 4 5
> which.min(x)
[1] 1
> which.max(x)
[1] 5
> which(x>3)
[1] 4 5

Vectors can be combined using the c() function. When the two vectors are
combined the numeric values are forced into character values. This shows that all
the members of a vector should be of the same basic data type.
> f = c(7, 5, 9)
> g = c(“aaa”, “bbb”, “ccc”)
> c(f, g)
[1] “7” “5” “9” “aaa” “bbb” “ccc”

Arithmetic operations in a vector will be performed member-wise. If two vectors


are of unequal length, the shorter vector will be recycled in order to match the
longer vector.
36 R Programming — An Approach for Data Analytics

> x = c(5, 8, 9)
> y = c(2, 6, 9)
>4*y
[1] 8 24 36
>x+y
[1] 7 14 18
>x-y
[1] 3 2 0
>x*y
[1] 10 48 81
>x/y
[1] 2.500000 1.333333 1.000000
> v = c(1, 2, 3, 4, 5, 6)
>x+v
[1] 6 10 12 9 13 15

The rep() function creates a vector with repeated elements. This function has
its other variants such as rep.int() and rep_len() whose usage is as given below.
> rep(1:3, 4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
> rep(1:3, each = 4)
[1] 1 1 1 1 2 2 2 2 3 3 3 3
> rep(1:3, times = 1:3)
[1] 1 2 2 3 3 3
> rep(1:3, length.out = 9)
[1] 1 2 3 1 2 3 1 2 3
> rep.int(1:3, 4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
37 Data Types in R

> rep_len(1:3, 9)
[1] 1 2 3 1 2 3 1 2 3

2.3. Matrices and Arrays


A matrix is a collection of data elements with the same basic type arranged in a two-
dimensional rectangular layout. An array consists of multidimensional rectangular
data. Matrices are special cases of two-dimensional arrays. To create an array the
array() function can be used and a vector of values and vector of dimensions are
passed to it.
> x <- array(1:24, dim = c(4, 3, 2),
dimnames = list(c(“a”, “b”, “c”, “d”), c(“e”, “f ”, “g”), c(“h”, “i”)))
>x
,,h
ef g
a15 9
b 2 6 10
c 3 7 11
d 4 8 12
,,i
e f g
a 13 17 21
b 14 18 22
c 15 19 23
d 16 20 24

The syntax for creating matrices is using the function matrix() and passing the
nrow or ncol argument instead of the dim argument in the arrays. A matrix can also
be created using the array() function where the dimension of the array is two.
> m <- matrix(1:12, nrow = 3, dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
>m
38 R Programming — An Approach for Data Analytics

def g
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12
> m1 <- array(1:12, dim = c(3,4),
dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
> m1
def g
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12

The argument byrow = TRUE in the matrix() function assigns the elements
row wise. If this argument is not specified, by default the elements are filled column
wise.
> m <- matrix(1:12, nrow = 3, byrow = TRUE,
dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))

The dim() function returns the dimensions of an array or a matrix. The functions
nrow() and ncol() returns the number of rows and number of columns of a matrix
respectively.
> dim(x)
[1] 4 3 2
> dim(m)
[1] 3 4
> nrow(m)
[1] 3
> ncol(m)
[1] 4
39 Data Types in R

The length() function also works for matrices and arrays. It is also possible to
assign new dimension for a matrix or an array using the dim() function.
> length(x)
[1] 24
> length(m)
[1] 12
> dim(m) <- c(6,2)

The functions rownames(), colnames() and dimnames() can be used to fetch the
row names, column names and dimension names of matrices and arrays respectively.
> rownames(m1)
[1] “a” “b” “c”
> colnames(m1)
[1] “d” “e” “f ” “g”
> dimnames(x)
[[1]]
[1] “a” “b” “c” “d”
[[2]]
[1] “e” “f ” “g”
[[3]]
[1] “h” “i”

It is possible to extract the element at the nth row and mth column using the
expression M[n, m]. The entire nth row can be extracted using M[n, ] and similarly,
the mth column can be extracted using M[,m]. Also, it is possible to extract more
than one column or row.
> M[2,3]
[1] 6
> M[2,]
[1] 4 5 6
40 R Programming — An Approach for Data Analytics

> M[,3]
[1] 3 6 9
> M[,c(1,3)]
[,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 7 9
> M[c(1,3),]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 7 8 9

The matrix transpose is constructed by interchanging its rows and columns


using the function t().
> t(M)
r1 r2 r3
c1 1 4 7
c2 2 5 8
c3 3 6 9

The columns of two matrices can be combined using the cbind() function and
similarly the rows of two matrices can be combined using the rbind() function.
> M1 = matrix(c(2,4,6,8,10,12), nrow=3, ncol=2)
> M1
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> M2 = matrix(c(3,6,9), nrow=3, ncol = 1)
> M2
41 Data Types in R

[,1]
[1,] 3
[2,] 6
[3,] 9
> cbind(M1, M2)
[,1] [,2] [,3]
[1,] 2 8 3
[2,] 4 10 6
[3,] 6 12 9
> M3 = matrix(c(4,8), nrow=1, ncol=2)
> M3
[,1] [,2]
[1,] 4 8
> rbind(M1, M3)
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
[4,] 4 8

A matrix can be deconstructed using the c() function which combines all
column vectors into one.
> c(M1)
[1] 2 4 6 8 10 12

The arithmetic operators “+”, “- “, “* “, “/ “ work element wise on matrices


and arrays. But the condition is that the matrices or arrays should be of conformable
sizes. The matrix multiplication is done using the operator “%*%”.
> M1 = matrix(c(2,4,6,8,10,12), nrow=3, ncol=2)
> M1
42 R Programming — An Approach for Data Analytics

[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> M2 = matrix(c(3,6,9,11,1,5), nrow=3, ncol = 2)
> M2
[,1] [,2]
[1,] 3 11
[2,] 6 1
[3,] 9 5
> M1 + M2
[,1] [,2]
[1,] 5 19
[2,] 10 11
[3,] 15 17
> M1 * M2
[,1] [,2]
[1,] 6 88
[2,] 24 10
[3,] 54 60
> M2 = matrix(c(3,6,9,11), nrow=2, ncol = 2)
> M2
[,1] [,2]
[1,] 3 9
[2,] 6 11
> M1 %*% M2
[,1] [,2]
[1,] 54 106
43 Data Types in R

[2,] 72 146
[3,] 90 186

The power operator “^” also works element wise on matrices. To find the
inverse of a matrix the function solve() can be used.
> M2
[,1] [,2]
[1,] 3 9
[2,] 6 11
> M2^-1
[,1] [,2]
[1,] 0.3333333 0.11111111
[2,] 0.1666667 0.09090909
> solve(M2)
[,1] [,2]
[1,] -0.5238095 0.4285714
[2,] 0.2857143 -0.1428571

2.4. Lists
Lists allow us to combine different data types in a single variable. Lists can be
created using the list() function. This function is similar to the c() function. The
contents of a list are just listed within the list() function as arguments separated by
a comma. List elements can be a vector, matrix or a function. It is possible to name
the elements of the list while creation or later using the names() function.
> L <- list(c(9,1, 4, 7, 0), matrix(c(1,2,3,4,5,6), nrow = 3))
>L
[[1]]
[1] 9 1 4 7 0
[[2]]
44 R Programming — An Approach for Data Analytics

[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

> names(L) <- c(“Num”, “Mat”)

>L
$Num
[1] 9 1 4 7 0
$Mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> L <- list(Num = c(9,1, 4, 7, 0), Mat = matrix(c(1,2,3,4,5,6), nrow = 3))
>L
$Num
[1] 9 1 4 7 0
$Mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

Lists can be nested. That is a list can be an element of another list. But, vectors,
arrays and matrices are not recursive/nested. They are atomic. The functions
is.recursive() and is.atomic() shows if a variable type is recursive or atomic respectively.
> is.atomic(list())
[1] FALSE
45 Data Types in R

> is.recursive(list())
[1] TRUE
> is.atomic(L)
[1] FALSE
> is.recursive(L)
[1] TRUE
> is.atomic(matrix())
[1] TRUE
> is.recursive(matrix())
[1] FALSE

The length() function works on list like in vectors and matrices. But, the dim(),
nrow() and ncol() functions returns only NULL.
> length(L)
[1] 2
> dim(L)
NULL
> nrow(L)
NULL
> ncol(L)
NULL

Arithmetic operations in list are possible only if the elements of the list are of
the same data type. Generally, it is not recommended. As in vectors the elements
of the list can be accessed by indexing them using the square brackets. The index
can be a positive number, or a negative number, or element names or logical values.
> L1 <- list(l1 = c(8, 9, 1), l2 = matrix(c(1,2,3,4), nrow = 2),
l3 = list( l31 = c(“a”, “b”), l32 = c(TRUE, FALSE) ))
> L1
$l1
46 R Programming — An Approach for Data Analytics

[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
$l3
$l3$l31
[1] “a” “b”
$l3$l32
[1] TRUE FALSE

> L1[1:2]
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4

> L1[-3]
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4

> L1[c(“l1”, “l2”)]


$l1
[1] 8 9 1
47 Data Types in R

$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4

> L1[c(TRUE, TRUE, FALSE)]


$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4

A list is a generic vector containing other objects.


> a = c(4,8,12)
> b = c(“abc”, “def ”, “ghi”, “jkl”, “mno”)
> d = c(TRUE, FALSE)
> t = list(a, b, d, 5)

The list t contains copies of the vectors a, b and d. A list slice is retrieved using
single square brackets []. In the below, t[2] contains a slice and a copy of b. Slice can
also be retrieved with multiple members.
> t[2]
[[1]]
[1] “abc” “def ” “ghi” “jkl” “mno”
> t[c(2,4)]
[[1]]
[1] “abc” “def ” “ghi” “jkl” “mno”
[[2]]
[1] 5
48 R Programming — An Approach for Data Analytics

To reference a list member directly double square bracket [[]] is used. Thus
t[[2]] retrieves the second member of the list t. This results in a copy of b, but not a
slice of b. It is also possible to modify the contents of the elements directly, but the
contents of b are unaffected.
> t[[2]]
[1] “abc” “def ” “ghi” “jkl” “mno”
> t[[2]][1] = “qqq”
> t[[2]]
[1] “qqq” “def ” “ghi” “jkl” “mno”
>b
[1] “abc” “def ” “ghi” “jkl” “mno”

We can assign names to the list members and reference lists by names instead of
numeric indexes. A list of two members is given as example below with the member
names as “first” and “second”. The list slice containing the member “first” can be
retrieved using the square brackets [] as shown below.
> l = list(first=c(1,2,3), second=c(“a”,”b”, “c”))
>l
$first
[1] 1 2 3
$second
[1] “a” “b” “c”
> l[“first”]
$first
[1] 1 2 3

The named list member can also be directly referenced with the $ operator or
double square brackets [[]] as below.
> l$first
[1] 1 2 3
49 Data Types in R

> l[[“first”]]
[1] 1 2 3

A vector can be converted to a list using the function as.list(). Similarly, a


list can be converted into a vector, provided the list contains scalar elements of
the same type. This is done using the conversion functions such as as.numeric(),
as.character() and so on. If a list consists of non-scalar elements, but if they are of
the same type, then it can be converted into a vector using the function unlist().
> v <- c(7, 3, 9, 2, 6)

> as.list(v)
[[1]]
[1] 7
[[2]]
[1] 3
[[3]]
[1] 9
[[4]]
[1] 2
[[5]]
[1] 6

> L <- list(3, 7, 8, 12, 14)


> as.numeric(L)
[1] 3 7 8 12 14
> L1 <- list(“aaa”, “bbb”, “ccc”)
> L1
[[1]]
[1] “aaa”
[[2]]
[1] “bbb”
50 R Programming — An Approach for Data Analytics

[[3]]
[1] “ccc”

> as.character(L1)
[1] “aaa” “bbb” “ccc”
> L1 <- list(l1 = c(78, 90, 21), l2 = c(11,22,33,44,55))
> L1
$l1
[1] 78 90 21
$l2
[1] 11 22 33 44 55

> unlist(L1)
l11 l12 l13 l21 l22 l23 l24 l25
78 90 21 11 22 33 44 55

The c() function can also be used to combine lists as we do for vectors.
> L1 <- list(l1 = c(78, 90, 21), l2 = c(11,22,33,44,55))
> L2 <- list(“aaa”, “bbb”, “ccc”)
> c(L1, L2)
$l1
[1] 78 90 21
$l2
[1] 11 22 33 44 55
[[3]]
[1] “aaa”
[[4]]
[1] “bbb”
[[5]]
[1] “ccc”
51 Data Types in R

2.5. Data Frames


A data frame is used for storing data tables. They store spread-sheet like data. It is a
list of vectors of equal length (not necessarily of the same basic data type). Consider
a data frame df1 consisting of three vectors a, b, and d.
> a = c(1, 2, 3)
> b = c(“a”, “b”, “c”)
> d = c(TRUE, FALSE, TRUE)
> df1 = data.frame(a, b, d)
> df1
a b d
1 1 a TRUE
2 2 b FALSE
3 3 c TRUE

By default the row names are automatically numbered from 1 to the number of
rows in the data frame. It is also possible to provide row names manually using the
row.names argument as below.
> df1 = data.frame(a, b, d, row.names = c(“one”, “two”, “three”))
> df1
a b d
one 1 a TRUE
two 2 b FALSE
three 3 c TRUE

The functions rownames(), colnames(), dimnames(), nrow(), ncol() and dim()


can be applied on the data frames as below. The length() and names() function,
returns the same result as that of ncol() and colnames() respectively.
> rownames(df1)
[1] “one” “two” “three”
> colnames(df1)
52 R Programming — An Approach for Data Analytics

[1] “a” “b” “d”


> dimnames(df1)
[[1]]
[1] “one” “two” “three”
[[2]]
[1] “a” “b” “d”

> nrow(df1)
[1] 3
> ncol(df1)
[1] 3
> dim(df1)
[1] 3 3
> length(df1)
[1] 3
> colnames(df1)
[1] “a” “b” “d”

It is possible to create data frames with different length of vectors as long as the
shorter ones can be recycled to match that of the longer ones. Otherwise, an error
will be thrown.
> df2 <- data.frame(x = 1, y = 2:3, y = 4:7)
> df2
x y y.1
1 1 2 4
2 1 3 5
3 1 2 6
4 1 3 7

The argument check.names can be set as FALSE so that a data frame will not
look for valid column names.
53 Data Types in R

> df3 <- data.frame(“BaD col” = c(1:5), “!@#$%^&*” = c(“aaa”))


> df3
BaD.col X........
1 1 aaa
2 2 aaa
3 3 aaa
4 4 aaa
5 5 aaa

There are many built-in data frames available in R (example – mtcars). When
this data frame is invoked in R tool, it produces the below result.
> mtcars
mpg  cyl  disp   hp  drat    wt ...
Mazda RX4      21.0    6   160  110  3.90  2.62 ...
Mazda RX4 Wag  21.0    6   160  110  3.90  2.88 ...
Datsun 710     22.8    4   108   93  3.85  2.32 ...
............

The top line contains the header or the column names. Each row denotes a record
or a row in the table. A row begins with the name of the row. Each data member of a
row is called a cell. To retrieve a cell value, we enter the row and the column number
of the cell in square brackets [] separated by a comma. The cell value of the second
row and third column is retrieved as below. The row and the column names can also
be used inside the square brackets [] instead of the row and column numbers.
> mtcars[2, 3]
[1] 160
> mtcars[“Mazda RX4 Wag”, “disp”]
[1] 160

The nrow() function gives the number of rows in a data frame and the ncol()
function gives the number of columns in a data frame. To get the preview or the first
few records of a data frame along with the header the head() function can be used.
54 R Programming — An Approach for Data Analytics

> nrow(mtcars)
[1] 32
> ncol(mtcars)
[1] 11
> head(mtcars)
mpg  cyl  disp   hp  drat    wt ...
Mazda RX4      21.0    6   160  110  3.90  2.62 ...
......

To retrieve a column from a data frame we use double square brackets [[]] and
the column name or the column number inside the [[]]. The same can be achieved
by making use of the $ symbol as well. This same result can also be achieved by
using single brackets [] by mentioning a comma instead of the row name / number
and using the column name / number as the second index inside the [].
> mtcars[[“hp”]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[[4]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars$hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[,”hp”]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[,4]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....

Similarly, if we use the column name or the column number inside a single
square bracket [], we get the below result.
> mtcars[4]
hp
Mazda RX4 110
55 Data Types in R

Mazda RX4 Wag 110


Datsun 710 93
....
> mtcars[c(“mpg”,”hp”)]
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
....

To retrieve a row from a data frame we use the single square brackets [] only by
mentioning the row name / number as the first index inside [] and a comma instead
of the column name / number.
> mtcars[6,]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....

> mtcars[c(6,18),]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....
Fiat 128 32.4 4 78.7 66 4.08 2.20....
> mtcars[“Valiant”,]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....

> mtcars[c(“Valiant”,”Fiat 128”),]


mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....
Fiat 128 32.4 4 78.7 66 4.08 2.20....
56 R Programming — An Approach for Data Analytics

If we need to fetch a subset of a data frame by selecting few columns and


specifying conditions on the rows, we can use the subset() function to do this. This
function takes the arguments, the data frame, the condition to be applied on the
rows and the columns to be fetched.
> x <- c(“a”, “b”, “c”, “d”, “e”, “f ”)
> y <- c(3, 4, 7, 8, 12, 15)
> z <- c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
> D <- data.frame(x, y, z)
>D

x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE

> subset(D, y<10 & z, x)


x
1 a
2 b
4 d

As we have for matrices the transpose of a data frame can be obtained using the
t() function as below.
> t(D)
[,1] [,2] [,3] [,4] [,5] [,6]
x “a” “b” “c” “d” “e” “f ”
y “ 3” “ 4” “ 7” “ 8” “12” “15”
z “ TRUE” “ TRUE” “FALSE” “ TRUE” “FALSE” “ TRUE”
57 Data Types in R

The functions rbind() and cbind() can also be applied on the data frames as we
do for the matrices. The only condition for rbind() is that the column names should
match, but for cbind() it does not check even if the column names are duplicated.
> x1 <- c(“aaa”, “bbb”, “ccc”, “ddd”, “eee”, “fff ”)
> y1 <- c(9, 12, 17, 18, 23, 32)
> z1 <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)
> E <- data.frame(x1, y1, z1)
>E
x1 y1 z1
1 aaa 9 TRUE
2 bbb 12 FALSE
3 ccc 17 TRUE
4 ddd 18 FALSE
5 eee 23 TRUE
6 fff 32 FALSE
> cbind(D, E)

x y z x1 y1 z1
1 a 3 TRUE aaa 9 TRUE
2 b 4 TRUE bbb 12 FALSE
3 c 7 FALSE ccc 17 TRUE
4 d 8 TRUE ddd 18 FALSE
5 e 12 FALSE eee 23 TRUE
6 f 15 TRUE fff 32 FALSE
> F <- data.frame(x, y, z)
>F
58 R Programming — An Approach for Data Analytics

x y z
1 a 9 TRUE
2 b 12 FALSE
3 c 17 TRUE
4 d 18 FALSE
5 e 23 TRUE
6 f 32 FALSE
> rbind(D, F)
x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE
7 a 9 TRUE
8 b 12 FALSE
9 c 17 TRUE
10 d 18 FALSE
11 e 23 TRUE
12 f 32 FALSE

The merge() function can be applied to merge two data frames provided they
have common column names. By default, the merge() function does the merging
based on all the common columns, otherwise one of the common column name has
to be specified.
> merge(D, F, by = “x”, all = TRUE)
59 Data Types in R

x y.x z.x y.y z.y


1 a 3 TRUE 9 TRUE
2 b 4 TRUE 12 FALSE
3 c 7 FALSE 17 TRUE
4 d 8 TRUE 18 FALSE
5 e 12 FALSE 23 TRUE
6 f 15 TRUE 32 FALSE

The functions colSums(), colMeans(), rowSums() and rowMeans() can be


applied on the data frames that have numeric values as below.
> x <- c(5, 6, 7, 8)
> y <- c(15, 16, 17, 18)
> z <- c(25, 26, 27, 28)
> G <- data.frame(x, y, z)
>G
x y z
1 5 15 25
2 6 16 26
3 7 17 27
4 8 18 28

> colSums(G[, 1:2])


x y
26 66

> colMeans(G[, 1:3])


x y z
6.5 16.5 26.5
60 R Programming — An Approach for Data Analytics

> rowSums(G[1:3, ])
1 2 3
45 48 51

> rowMeans(G[2:4, ])
2 3 4
16 17 18

2.6. Factors
Factors are used to store categorical data like gender (“Male” or “Female”). They
behave sometimes like character vectors and sometimes like integer vectors based
on the context.

Factors stores categorical data and they behave like strings sometimes and
integers sometimes. Consider a data frame that stores the weight of few males
and females. In this case the column that stores the gender is a factor as it stores
categorical data. The choices “female” and “male” are called the levels of the factor.
This can be viewed by using the levels() function and nlevels() function.
> weight <- data.frame(wt_kg = c(60,82,45, 49,52,75,68),
gender = c(“female”,”male”, “female”, “female”, “female”, “male”, “male”))
> weight
wt_kg gender
1 60 female
2 82 male
3 45 female
4 49 female
5 52 female
6 75 male
7 68 male

> weight$gender
61 Data Types in R

[1] female male female female female male male


Levels: female male
> levels(weight$gender)
[1] “female” “male”
> nlevels(weight$gender)
[1] 2

At the atomic level a factor can be created using the factor() function, which
takes a character vector as the argument.
> gender <- factor(c(“female”, “male”, “female”, “female”, “female”, “male”, “male”))
> gender
[1] female male female female female male male
Levels: female male

The levels argument can be used in the factor() function to specify the levels of
the factor. It is also possible to change the levels once the factor is created. This is
done using the function levels() or the function relevel(). The function relevel() just
mentions which level comes first.
> gender <- factor(c(“female”, “male”, “female”, “female”, “female”,
“male”, “male”), levels = c(“male”, “female”))
> gender
[1] female male female female female male male
Levels: male female
> levels(gender) <- c(“F”, “M”)
> gender
[1] M F M M M F F
Levels: F M
> relevel(gender, “M”)
[1] M F M M M F F
Levels: M F
62 R Programming — An Approach for Data Analytics

It is possible to drop a level from a factor using the function droplevels() when
the level is not in use as in the example below. [Note: the function is.na() is used to
remove the missing value].
> diet <- data.frame(eat = c(“fruit”, “fruit”, “vegetable”, “fruit”),
type = c(“apple”, “mango”, NA, “papaya”))
> diet
eat type
1 fruit apple
2 fruit mango
3 vegetable <NA>
4 fruit papaya
> diet <- subset(diet, !is.na(type))
> diet
eat type
1 fruit apple
2 fruit mango
4 fruit papaya
> diet$eat
[1] fruit fruit fruit
Levels: fruit vegetable
> levels(diet)
NULL
> levels(diet$eat)
[1] “fruit” “vegetable”
> unique(diet$eat)
[1] fruit
Levels: fruit vegetable
63 Data Types in R

> diet$eat <- droplevels(diet$eat)


> levels(diet$eat)
[1] “fruit”

In some cases, the levels need to be ordered as in rating a product or course. The
ratings can be “Outstanding”, “Excellent”, “Very Good”, “Good”, “Bad”. When a
factor is created with these levels, it is not necessary they are ordered. So, to order the
levels in a factor, we can either use the function ordered() or the argument ordered =
TRUE in the factor() function. Such ordering can be useful when analysing survey
data.
> ch <- c(“Outstanding”, “Excellent”, “Very Good”, “Good”, “Bad”)
> val <- sample(ch, 100, replace = TRUE)
> rating <- factor(val, ch)
> rating
[1] Outstanding Bad Outstanding Good Very Good Very Good
[7] Excellent Outstanding Bad Excellent Very Good Bad
...
Levels: Outstanding Excellent Very Good Good Bad
> is.factor(rating)
[1] TRUE
> is.ordered(rating)
[1] FALSE
> rating_ord <- ordered(val, ch)
> is.factor(rating_ord)
[1] TRUE
> is.ordered(rating_ord)
[1] TRUE
> rating_ord
64 R Programming — An Approach for Data Analytics

[1] Outstanding Bad Outstanding Good Very Good Very Good


[7] Excellent Outstanding Bad Excellent Very Good Bad
...
Levels: Outstanding < Excellent < Very Good < Good < Bad

Numeric values can be summarized into factors using the cut() function and
the result can be viewed using the table() function which lists the count of numbers
in each category. For example let us consider the variable age which has the numeric
values of ages. These ages can be grouped using the cut() function with an interval
of 10 and the result is a factor age_group.
> age <- c(18,20, 31, 32, 33, 35, 41, 38, 45, 48, 51, 27, 29, 42, 39)
> age_group <- cut(age, seq.int(15, 55, 10))
> age
[1] 18 20 31 32 33 35 41 38 45 48 51 27 29 42 39
> age_group
[1] (15,25] (15,25] (25,35] (25,35] (25,35] (25,35] (35,45] (35,45] (35,45] (45,55]
[11] (45,55] (25,35] (25,35] (35,45] (35,45]
Levels: (15,25] (25,35] (35,45] (45,55]
> table(age_group)
age_group
(15,25] (25,35] (35,45] (45,55]
2 6 5 2

The function gl() can be used to create a factor, which takes the first argument
that tells how many levels the factor contains and the second argument that tells
how many times each level has to be repeated as value. This function can also take
the argument labels, which lists the names of the factor levels. The function can
also be made to list alternating values of the labels as below.
> gl(5,3)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
65 Data Types in R

> gl(5,3, labels = c(“one”, “two”, “three”, “four”, “five”))


[1] one one one two two two three three three four four four five
[14] five five
Levels: one two three four five
> gl(5,1,15)
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Levels: 1 2 3 4 5

The factors thus generated can be combined using the function interaction() to
get a resultant combined factor.
> fac1 <- gl(5,3, labels = c(“one”, “two”, “three”, “four”, “five”))
> fac2 <- gl(5,1,15, labels = c(“a”, “b”, “c”, “d”, “e”, “f ”, “g”, “h”, “i”, “j”,
“k”, “l”, “m”, “n”, “o”))
> interaction(fac1, fac2)
[1] one.a one.b one.c two.d two.e two.a three.b three.c three.d four.e
[11] four.a four.b five.c five.d five.e
75 Levels: one.a two.a three.a four.a five.a one.b two.b three.b four.b ... five.o

2.7. Strings
Strings are stored in character vectors. Most string manipulation functions act
on character vectors. Character vectors can be created using the c() function by
enclosing the string in double or single quotes. (Generally we follow only double
quotes). The paste() function can be used to concatenate two strings with a space
in between. If the space need not be shown, we use the function paste0(). To have
specified separator between the two concatenated string, we use the argument
sep in the paste() function. The result can be collapsed into one string using the
collapse argument.
> c(“String 1”, ‘String 2’)
[1] “String 1” “String 2”
> paste(c(“Pine”, “Red”), “Apple”)
66 R Programming — An Approach for Data Analytics

[1] “Pine Apple” “Red Apple”


> paste0(c(“Pine”, “Red”), “Apple”)
[1] “PineApple” “RedApple”
> paste(c(“Pine”, “Red”), “Apple”, sep = “-”)
[1] “Pine-Apple” “Red-Apple”
> paste(c(“Pine”, “Red”), “Apple”, sep = “-”, collapse = “, “)
[1] “Pine-Apple, Red-Apple”

The to String() function can be used to convert a number vector into a character
vector, with the elements separated by a comma and a space. It is possible to specify
the width of the print string in this function.
> x <- c(1:10)^3
>x
[1] 1 8 27 64 125 216 343 512 729 1000
> toString(x)
[1] “1, 8, 27, 64, 125, 216, 343, 512, 729, 1000”
> toString(x, 18)
[1] “1, 8, 27, 64, ....”

The cat() function is also similar to the paste() function, but there is little
difference in it as shown below.
> cat(c(“Red”, “Pine”), “Apple”)
Red Pine Apple

The noquote() function forces the string outputs not to be displayed with
quotes.
> a <- c(“I”, “am”, “a”, “data”, “scientist”)
>a
[1] “I” “am” “a” “data” “scientist”
> noquote(a)
[1] I am a data scientist
67 Data Types in R

The formatC() function is used to format the numbers and display them as
strings. This function has the arguments digits, width, format, flag etc which can be
used as below. A slight variation of the function formatC() is the function format()
whose usage is as shown below.
> h <- c(4.567, 8.981, 27.772)
>h
[1] 4.567 8.981 27.772
> formatC(h)
[1] “4.567” “8.981” “27.77”
> formatC(h, digits = 3)
[1] “4.57” “8.98” “27.8”
> formatC(h, digits = 3, width = 5)
[1] “ 4.57” “ 8.98” “ 27.8”
> formatC(h, digits = 3, format = “e”)
[1] “4.567e+00” “8.981e+00” “2.777e+01”
> formatC(h, digits = 3, flag = “+”)
[1] “+4.57” “+8.98” “+27.8”

> format(h)
[1] “ 4.567” “ 8.981” “27.772”
> format(h, digits = 3)
[1] “ 4.57” “ 8.98” “27.77”
> format(h, digits = 3, trim = TRUE)
[1] “4.57” “8.98” “27.77”

The sprint() function is also used for formatting strings and passing number
values in between the strings. The argument %s in this function stands for a string
to be passed. The argument %d and argument %f stands for integer and floating-
point number. The usage of this function can be understood by the below example.
68 R Programming — An Approach for Data Analytics

> x <- c(1, 2, 3)


> sprintf(“The number %d in the list is = %f ”, x, h)
[1] “The number 1 in the list is = 4.567000”
[2] “The number 2 in the list is = 8.981000”
[3] “The number 3 in the list is = 27.772000”

To print a tab in between text, we can use the cat() function with the special
character “\t” included in between the text as below. Similarly, if we need to insert
a new line in between the text, we use “\n”. In this cat() function the argument fill
= TRUE means that after printing the text, the cursor is placed in the next line.
Suppose if a back slash has to be used in between the text, it is preceded by another
back slash. If we enclose the text in double quotes and if the text contains a double
quote in between, it is also preceded by a back slash. Similarly, if we enclose the
text in single quotes and if the text contains a single quote in between, it is also
preceded by a back slash. If we enclose the text in double quotes and if the text
contains a single quote in between, or if we enclose the text in single quotes and if
the text contains a double quote in between, it is not a problem (No need for back
slash).
> cat(“Black\tBerry”, fill = TRUE)
Black Berry
> cat(“Black\nBerry”, fill = TRUE)
Black
Berry
> cat(“Black\\Berry”, fill = TRUE)
Black\Berry
> cat(“Black\”Berry”, fill = TRUE)
Black”Berry
> cat(‘Black\’Berry’, fill = TRUE)
Black’Berry
> cat(‘Black”Berry’, fill = TRUE)
Black”Berry
69 Data Types in R

> cat(“Black’Berry”, fill = TRUE)


Black’Berry

The function toupper() and tolower() are used to convert a string into upper
case or lower case respectively. The substring() or the substr() function is used to cut
a part of the string from the given text. Its arguments are the text, starting position
and ending position. Both these functions produce the same result.
> toupper(“The cat is on the Wall”)
[1] “THE CAT IS ON THE WALL”
> tolower(“The cat is on the Wall”)
[1] “the cat is on the wall”

> substring(“The cat is on the wall”, 3, 10)


[1] “e cat is”
> substr(“The cat is on the wall”, 3, 10)
[1] “e cat is”
> substr(“The cat is on the wall”, 5, 10)
[1] “cat is”

The function strsplit() does the splitting of a text into many strings based on
the splitting character mentioned as argument. In the below example the splitting
is done when a space is encountered. It is important to note that this function
returns a list and not a character vector as a result.
> strsplit(“I like Bannana, Orange and Pineapple”, “ “)
[[1]]
[1] “I” “like” “Bannana,” “Orange” “and” “Pineapple”

In this same example if the text has to be split when a comma or space is
encountered it is mentioned as “,?”. This means that the comma is optional and
space is mandatory for splitting the given text.
70 R Programming — An Approach for Data Analytics

> strsplit(“I like Bannana, Orange and Pineapple”, “,? “)


[[1]]
[1] “I” “like” “Bannana” “Orange” “and” “Pineapple”

The default R’s working directory can be obtained using the function getwd()
and this default directory can be changed using the function setwd(). The directory
path mentioned in the setwd() function should have the forward slash instead of
backward slash as in the example below.
> getwd()
[1] “C:/Users/admin/Documents”
> setwd(“C:/Program Files/R”)
> getwd()
[1] “C:/Program Files/R”

It is also possible to construct the file paths using the file.path() function which
automatically inserts the forward slash between the directory names. The function
R.home() list the home directory where R is installed.
> file.path(“C:”, “Program Files”, “R”, “R-3.3.0”)
[1] “C:/Program Files/R/R-3.3.0”
> R.home()
[1] “C:/PROGRA~1/R/R-33~1.0”

Paths can also be specified by relative terms such as “.” denotes current directory,
“..” denotes parent directory and “~” denotes home directory. The function path.
expand() converts relative paths to absolute paths.
> path.expand(“.”)
[1] “.”
> path.expand(“..”)
[1] “..”
> path.expand(“~”)
[1] “C:/Users/admin/Documents”
71 Data Types in R

The function basename() returns only the file name leaving its directory if
specified. On the other hand the function dirname() returns only the directory
name leaving the file name.
> filename <- “C:/Program Files/R/R-3.3.0/bin/R.exe”
> basename(filename)
[1] “R.exe”
> dirname(filename)
[1] “C:/Program Files/R/R-3.3.0/bin”

2.8. Dates and Times


Dates and Times are common in data analysis and R has a wide range of capabilities
for dealing with dates and times.

2.8.1. Date and Time Classes

R has three date and time base classes and they are POSIXct, POSIXlt and Date.
POSIX is a set of standards that defines how dates and times should be specified and
“ct” stands for “calendar time”. POSIXlt stores dates as a list of seconds, minutes,
hours, day of month etc. For storing and calculating with dates, we can use POSIXct
and for extracting parts of dates, we can use POSXlt.

The function Sys.time() is used to return the current date and time. This
returned value is by default in the POSIXct form. But, this can be converted to
POSIXlt form using the function as.POSIXlt(). When printed both forms of date
and time are displayed in the same manner, but their internal storage mechanism
differs. We can also access individual components of a POSIXlt date using the dollar
symbol or the double brackets as shown below.
> Sys.time()
[1] “2017-05-11 14:31:29 IST”
> t <- Sys.time()
> t1 <- Sys.time()
> t2 <- as.POSIXlt(t1)
72 R Programming — An Approach for Data Analytics

> t1
[1] “2017-05-11 14:39:39 IST”
> t2
[1] “2017-05-11 14:39:39 IST”
> class(t1)
[1] “POSIXct” “POSIXt”
> class(t2)
[1] “POSIXlt” “POSIXt”
> t2$sec
[1] 39.20794
> t2[[“min”]]
[1] 39
> t2$hour
[1] 14
> t2$mday
[1] 11
> t2$wday
[1] 4

The Date class stores the dates as number of days from start of 1970. This class
is useful when time is insignificant. The as.Date() function can be used to convert
a date in other class formats to the Date class format.
> t3 <- as.Date(t2)
> t3
[1] “2017-05-11”

There are also other add-on packages available in R to handle date and time and
they are date, dates, chron, yearmon, yearqtr, timeDate, ti and jul.
73 Data Types in R

2.8.2. Date Conversions

In CSV files the dates will be normally stored as strings and they have to be converted
into date and time using any of the packages. For this we need to parse the strings
using the function strptime() and this returns the date of the format POSIXlt.
The date format is specified as a string and passed as argument to the strptime()
function. If the given string does not match the format given in the format string,
then it returns NA.
> date1 <- strptime(“22:15:45 22/08/2015”, “%H:%M:%S %d/%m/%Y”)
> date1
[1] “2015-08-22 22:15:45 IST”

> date2 <- strptime(“22:15:45 22/08/2015”, “%H:%M:%S %d-%m-%Y”)


> date2
[1] NA

In the format string “%H” denotes hour in 24 hour system, “%M” denotes
minutes, “%S” denotes second, “%m” denotes the number of the month, “%d”
denotes the day of the month as number, “%Y” denotes four digit year.

To convert a date into a string the function strftime() is used. This function also
takes a date formatting string as argument like strptime(). In the format string “%I”
denotes hour in 12 hours system, “%p” denotes AM/PM, “%A” denotes the string of
day of the week, “%B” denotes the string of name of the month.
> strftime(Sys.Date(),”It’s %I:%M%p on %A %d %B, %Y.”)
[1] “It’s 12:00AM on Thursday 11 May, 2017.”

2.8.3. Time Zones

It is possible to specify the time zone when parsing a date string using strptime()
or strftime() functions. If this is not specified, the default time zone is taken. The
functions Sys.timezone() and Sys.getlocale(“LC_TIME”) are used to get the default
time zone of the system and the operating system respectively.
81 Data Types in R

 To create an integer variable and force a numeric value into an integer in


R, the as.integer() function is invoked.
 A paste() function can be used to concatenate two character values.
 The substr() function can be used to extract a substring from a given
string.
 The sub() function is used to replace the first occurrence of a string with
another string.
 The vector() function creates a vector of a specified type and length.
 The seq() function allows to generate sequences.
 The function length() can be used to find the length of the vector.
 The which() function returns the elements of the vector which satisfies
the condition specified within this function.
 To create an array the array() function can be used.
 A matrix can also be created using the array() function where the
dimension of the array is two.
 The columns of two matrices can be combined using the cbind() function.
 The rows of two matrices can be combined using the rbind() function.
 Lists can be created using the list() function.
 To get the preview or the first few records of a data frame along with the
header the head() function can be used.
 The merge() function can be applied to merge two data frames provided
they have common column names.
 Factors are used to store categorical data and they behave like strings
sometimes and integers sometimes.
 At the atomic level a factor can be created using the factor() function.
 The function gl() can be used to create a factor.
 The factors can be combined using the function interaction().
 The default R’s working directory can be obtained using the function
getwd().
 The default directory can be changed using the function setwd().
 The function R.home() list the home directory where R is installed.
82 R Programming — An Approach for Data Analytics

 R has three date and time base classes and they are POSIXct, POSIXlt and
Date.
 The function Sys.time() is used to return the current date and time.
 The as.Date() function can be used to convert a date in other class formats
to the Date class format.
 The functions Sys.timezone() is used to get the default time zone of the
system.
 The lubridate package has the functions, dyears(), dweeks(), ddays(),
dhours(), dminutes(), dseconds() etc that specify the duration of year,
week, day, hour, minute and second in terms of seconds.
 The lubridate package has the functions, years(), weeks(), days(), hours(),
minutes(), seconds() etc that specify the period of year, week, day, hour,
minute and second in terms of clock time.
CHAPTER 3

Data Preparation

 OBJECTIVES

On completion of this Chapter you will be able to:


 know about the default datasets available in R
 know how to import and export CSV files in R
 know how to import unstructured data files into R
 know how to import XML and HTML files into R
 know how to import JASON and YAML files into R
 know how to import and export excel files in R
 know how to import SAS, SPSS and MATLAB files into R
 know how to import web data files into R
 understand the concept of accessing various databases from R
 manipulate string data
 manipulate data frames
 understand how to melt and cast data in data frames
 understand how the grouping functions are applied on the data in R

3.1. Datasets
R has many datasets built in. R can read data from variety of other data sources
and in variety of formats. One of the packages in R is datasets which is filled with
example datasets. Many other packages also contain datasets. We can see all the
datasets available in the loaded packages using the data() function.
84 R Programming — An Approach for Data Analytics

To access a particular dataset use the data() function with its argument as the
dataset name enclosed within double quotes and the second optional argument
being the package name in which the dataset is present (This second argument is
required only if the particular package is not loaded). The invoked dataset can be
listed just like a data frame using the head() function.
> data(“kidney”, package = “survival”)
> head(kidney)
id time status age sex disease frail
1 1 8 1 28 1 Other 2.3
2 1 16 1 28 1 Other 2.3
3 2 23 1 48 2 GN 1.9
….

Figure 3.1 R-Studio Showing the List of Datasets


85 Data Preparation

3.2. Importing and Exporting Files

3.2.1. Text and CSV Files

Text documents have several formats. Common format are CSV (Comma Separated
Values), XML (Extended Markup Language), JSON (JavaScript Object Notation)
and YAML. An example of an unstructured text data is a book.

Comma Separated Values (CSV) Files is a spreadsheet like data stored with
comma delimited values. The read.table() function reads these files and stores the
result in a data frame. If the data has header, it is required to pass the argument
header = TRUE to the read.table() function. The argument fill = TRUE makes
the read.table() function substitute NA values for the missing fields. The system.
file() function is used to locate files that are inside a package. In the below example
“extdata” is the folder name and the package name is “learning” and the file name
is “RedDeerEndocranialVolume,dlm” The str() function takes the data frame name
as the argument and lists the structure of the dataset stored in the data frame.
> install.packages(“learningr”)
> library(learningr)
> deer_file <- system.file(“extdata”,”RedDeerEndocranialVolume.dlm”,
package = “learningr”)
> deer_data <- read.table(deer_file, header=TRUE, fill=TRUE)
> str(deer_data)
‘data.frame’: 33 obs. of 8 variables:
$ SkullID : Factor w/ 33 levels “A4”,”B11”,”B12”,..: 14 2 17 16 15 13 10 11
19 3 ...
$ VolCT : int 389 389 352 388 375 325 346 302 379 410 ...
$ VolBead : int 375 370 345 370 355 320 335 295 360 400 ...
$ VolLWH : int 1484 1722 1495 1683 1458 1363 1250 1011 1621 1740 ...
$ VolFinarelli: int 337 377 328 377 328 291 289 250 347 387 ...
$ VolCT2 : int NA NA NA NA NA NA 346 303 375 413 ...
86 R Programming — An Approach for Data Analytics

$ VolBead2 : int NA NA NA NA NA NA 330 295 365 395 ...


$ VolLWH2 : int NA NA NA NA NA NA 1264 1009 1647 1728 ...

The column names and row names are listed by default and if the row names
are not given in the dataset, the rows are simply numbered 1, 2, 3 and so on. The
arguments specify how the file will be read. The argument sep determines the
character to use as separator between fields. The nrow argument specifies the lines
of data to read. The argument skip specifies the number of lines to skip at the start
of the file. For the functions read.table() and read.csv() the default separator is set
to comma and they assume the data has header row. The function read.csv2() uses
the semicolon as the separator and comma instead of decimals. The read.delim()
function imports the tab-delimited files with full stops for decimal places. The read.
delim2() function imports the tab-delimited files with commas for decimal places.
> read.csv(deer_file, header=FALSE, skip = 3, nrow = 2)
V1
1 DIC90 352 345 1495 328
2 DIC83 388 370 1683 377
> head(deer_data)
SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2
1 DIC44 389 375 1484 337 NA NA NA
2 B11 389 370 1722 377 NA NA NA
3 DIC90 352 345 1495 328 NA NA NA
….

The colbycol and sqldf packages contain functions that allow to read part of
the CSV file into R. These are useful when we don’t need all the columns or all the
rows. For low-level control we can use the scan() function to import CSV file. For
data exported from other languages we may need to pass the na.strings argument to
the read.table() function to replace the missing values. If the data is exported from
SQL, we use na.strings = “NULL” and if the data is exported from SAS or Stata,
we use na.strings = “.”. If the data is exported from Excel we use the na.strings =
c(“”,”#N/A”, “#DIV/0!”, “#NUM!”).
87 Data Preparation

Writing data from R into a file is easier than reading files into R. For this we use
the functions write.table() and write.csv(). These functions take a data frame and a
file path as arguments. They also have arguments to specify if we need not include
row names in the output file or to specify the character encoding of the output file.
> write.csv(deer_data,”F:/deer.csv”, row.names = FALSE, fileEncoding = “utf8”)

3.2.2. Unstructured Files

If the file structure is week, it is easier to read the file as lines of text using the function
readLines() and then parse the contents. The readLines() function accepts a path
to the file as the argument. Similarly, the writeLines() function takes a text line or a
character vector and the file name as argument and writes the text to the file.
> tempest <- readLines(“F:/Tempest.txt”)
> tempest
[1] “The writing of Prefaces to Plays was probably invented by some very”
[2] “ambitious Poet, who never thought he had done enough: Perhaps by
ome”
[3] “Ape of the French Eloquence, which uses to make a business of a Letter
of ”
....
> writeLines(“This book is about a story by Shakespeare”, “F:/story.csv”)

3.2.3. XML and HTML Files

XML files are used for storing nested data. Few of them are RSS (Really Simple
Syndication) feeds, SOAP (Simple Object Access Protocols) and XHTML Web Pages.
To read the XML files, the XML package has to be installed. When an XML file is
imported, the result can be stored using the internal nodes or the R nodes. If the result
is stored using internal nodes, it allows to query the node tree using the XPath language
(used for interrogating XML documents). The XML file can be imported using the
function xmlParse() function. This function can take the argument useInternalNodes
= FALSE to use R-level nodes instead of the internal nodes while importing the XML
files. But, this is set by default by the xml TreeParse() function.
88 R Programming — An Approach for Data Analytics

> install.packages(“XML”)
> library(XML)

> xml_file <- system.file(“extdata”, “options.xml”, package = “learningr”)


> r_options <- xmlParse(xml_file)

> xmlParse(xml_file, useInternalNodes = FALSE)


> xmlTreeParse(xml_file)

The functions for importing HTML pages are htmlParse() and htmlTreeParse()
and they behave same as the xmlParse() and xmlTreeParse() functions.

3.2.4. JASON and YAML Files

The two packages dealing with JSON data are RJSONIO and rjson and the best of
these is the RJSONIO. The function used to import the JSON file is fromJSON()
and the function used to export the JSON file is toJSON(). The yaml package has
two functions for importing YAML data and they are yaml.load() and yaml.load_
file(). The function as.yaml() performs the task of converting R objects to YAML
strings.

Many softwares store their data in binary formats which are smaller in size
than the text files. They hence provide performance gains at the expense of human
readability.

3.2.5. Excel Files

Excel is the world’s most powerful data analysis tool and its document formats
are XLX and XLSX. Spreadsheets can be imported with the functions read.xlsx()
and read.xlsx2(). The colClasses argument determines what class each column
should have in the resulting data frame and this argument is optional in the above
functions. To write to an excel file from R we use the function write.xlsx2() that
takes the data frame and the file name as arguments. There is another package
xlsReadWrite that does the same function of the xlxs package but this one works
only in 32-bit R installations and only on windows.
89 Data Preparation

> install.packages(“xlsx”)
> library(xlsx)
> logfile <- read.xlsx2(“F:/Log2015.xls”, sheetIndex = 1, startRow = 2, endrow = 72,
colIndex = 1:5, colClasses = c(“character”, “numeric”, “character”,
“character”, “integer”))

3.2.6. SAS, SPSS and MATLAB Files

The files from a statistical package are imported using the foreign package. The
read.ssd() function is used to read SAS datasets and the read.dta() function is
used to read Stata DTA files. The read.spss() function is used to import the SPSS
data files. Similarly, these files can be written with the write.foreign() function.
The MATLAB binary data files can be read and written using the readMat() and
writeMat() functions in the R.matlab package. The files in picture formats can be
read via the jpeg, png, tiff, rtiff and readbitmap packages.

3.2.7. Web Data

R has ways to import data from web sources using Application Programming Interface
(API). For example the World Bank makes its data available using the WDI package
and the Polish government data can be accessed using the SmarterPoland package.
The twitter package provides access to Twitter’s users and their tweet.

The read.table() function can accept URL rather than a local file. Accessing a
large file from internet can be slow and if the file is required frequently, it is better
to download the file using the download.file() function and create a local copy and
then import that.
> cancer_url <- “http://repository.seasr.org/Datasets/UCI/csv/breast-cancer.csv”
> cancer_data <- read.csv(cancer_url)
> str(cancer_data)
‘data.frame’: 287 obs. of 10 variables:
$ age : Factor w/ 7 levels “20-29”,”30-39”,..: 7 3 4 4 3 3 4 4 3 3 ...
$ menopause : Factor w/ 4 levels “ge40”,”lt40”,..: 4 3 1 1 3 3 3 1 3 3 ...
90 R Programming — An Approach for Data Analytics

$ tumor.size : Factor w/ 12 levels “0-4”,”10-14”,..: 12 3 3 7 7 6 5 8 2 1 ...


$ inv.nodes : Factor w/ 8 levels “0-2”,”12-14”,..: 8 1 1 1 1 5 5 1 1 1 ...
$ node.caps : Factor w/ 4 levels “”,”no”,”String”,..: 3 4 2 2 4 4 2 2 2 2 ...
$ deg.malig : Factor w/ 4 levels “1”,”2”,”3”,”String”: 4 3 1 2 3 2 2 3 2 2 ...
$ breast : Factor w/ 3 levels “left”,”right”,..: 3 2 2 1 2 1 2 1 1 2 ...
$ breast.quad: Factor w/ 7 levels “”,”central”,”left_low”,..: 7 4 2 3 3 6 4 4 4 5
$ irradiat : Factor w/ 3 levels “no”,”String”,..: 2 1 1 1 3 1 3 1 1 1 ...
$ Class : Factor w/ 3 levels “no-recurrence-events”,..: 3 2 1 2 1 2 1 1 1 1

> local_copy <- “cancer.csv”


> download.file(cancer_url, local_copy)
trying URL ‘http://repository.seasr.org/Datasets/UCI/csv/breast-cancer.csv’
Content type ‘application/octet-stream’ length 18804 bytes (18 KB)
downloaded 18 KB

> cancer_data <- read.csv(local_copy)

3.3. Accessing Databases


R can connect to all database management systems (DBMS) like SQLite, MySQL,
MariaDB, PostgreSQL and Oracle using the DBI package. We need to install and
load the DBI package and the backend package RSQLite. Define a database driver
of type SQLite using the function dbDriver() and setup a connection to the database
using the function dbConnect(). To retrieve data from the databases you write a
query as a string containing SQL commands and send it to the database with the
function dbGetQuery().
> install.packages(“DBI”)
> install.packages(“RSQLite”)
> library(DBI)
> library(RSQLite)
> driver <- dbDriver(“SQLite”)
91 Data Preparation

> db_file <- system.file(“extdata”, “crabtag.sqlite”, package = “learningr”)


> conn <- dbConnect(driver, db_file)
> query <- “SELECT * FROM IdBlock”
> id_block <- dbGetQuery(conn, query)
> id_block
Tag ID Firmware Version No Firmware Build Level
1 A03401 2 70

Alternatively, the function dbReadTable() reads a table from the connected


database and the function dbListTables() can list all the tables in the database.
> dbReadTable(conn, “idblock”)
Tag.ID Firmware.Version.No Firmware.Build.Level
1 A03401 2 70
> dbListTables(conn)
[1] “Daylog” “DeploymentNotebook” “IdBlock”
[4] “LifetimeNotebook” “TagNotebook”

The function dbDisconnect() is used for disconnecting and unloading the driver
and the function dbUnloadDriver() is used to unload the defined database driver.
> dbDisconnect(conn)
> dbUnloadDriver(driver)

For MySQL database we need to load the RMySQL package and set the
driver type to be “MySQL”. The PostgreSQL, Oracle and JDBC databases need
the PostgreSQL, ROracle and RJDBC packages respectively. To connect to an
SQL Server or Access databases, the RODBC package needs to be loaded. In this
package, the function odbcConnect() is used to connect to the database and the
function sqlQuery() is used to run a query and the function odbcClose() is used to
close and cleanup the database connections. There are not much matured methods
to access the NoSQL (Not only SQL) databases (lightweight databases – scalable
than traditional SQL relational databases). To access the MongoDB database the
packages RMongo and rmongodb are used. The database Cassandra can be accessed
using the package RCassandra.
92 R Programming — An Approach for Data Analytics

3.4. Data Cleaning and Transforming

3.4.1. Manipulating Stings

In some datasets or data frames logical values are represented as “Y” and “N” instead
of TRUE and FALSE. In such cases it is possible to replace the string with correct
logical value as in the example below.
> a <- c(1,2,3)
> b <- c(“A”, “B”, “C”)
> d <- c(“Y”, “N”, “Y”)
> df1 <- data.frame(a, b, d)
> df1
a b d
1 1 A Y
2 2 B N
3 3 C Y
convt <- function(x)
{
y <- rep.int(NA, length(x))
y[x == “Y”] <- TRUE
y[x == “N”] <- FALSE
y
}
> df1$d <- convt(df1$d)
> df1
a b d
1 1 A TRUE
2 2 B FALSE
3 3 C TRUE
93 Data Preparation

The functions grep() and grepl() are used to find a pattern in a given text and the
functions sub() and gsub() are used to replace a pattern with another in a given text.
The above four functions belong to the base package, but the package stringr consists
of many such string manipulation functions. The function str_
detect() in the stringr package does the same function of detecting the presence of
a given pattern in the given text. We can also use the function fixed() to mention if
the string that we are searching for is a fixed one.
> grep(“my”, “This is my pen”)
[1] 1
> grepl(“my”, “This is my pen”)
[1] TRUE
> sub(“my”, “your”,”This is my pen”)
[1] “This is your pen”
> gsub(“my”, “your”,”This is my pen”)
[1] “This is your pen”
> str_detect(“This is my pen”, “my”)
[1] TRUE
> str_detect(“This is my pen”, fixed(“my”))
[1] TRUE

In the function str_detect(), it is possible to specify the search pattern with a


pipe symbol “|” to denote, that we need to find either of the two patterns specified.
That is we may be looking for the presence of “,” or “and” in the given text as shown
below.
> str_detect(“I like mangoes, oranges and pineapples”, “,|and”)
[1] TRUE

The function str_split() is used to split a given text based on the pattern specified
as below. This function returns a vector. But the function str_split_fixed() can be
used to split the given text into fixed number of strings based on the specified
patterns. This function returns a matrix.
94 R Programming — An Approach for Data Analytics

> str_split(“I like mangoes, oranges and pineapples”, “,|and”)

[[1]]

[1] “I like mangoes” “ oranges “ “ pineapples”

> str_split_fixed(“I like mangoes, oranges and pineapples”, “,|and”, n = 3)

[,1] [,2] [,3]

[1,] “I like mangoes” “ oranges “ “ pineapples”

The function str_count() can be used to count the number of occurrence of a


given pattern in the given text.
> str_count(“I like mangoes, oranges and pineapples”, “a|o”)
[1] 6
> str_count(“I like mangoes, oranges and pineapples”, “s”)
[1] 3

The function str_replace() can be used to replace the specified pattern with
another pattern in the given text. This function will only replace the first occurrence
of the pattern. Hence, to replace all the occurrences of the pattern we use the
function str_replace_all(). In these functions, to denote multiple patterns to be
replaced, they can be placed within square brackets. This means it should replace
all that matches these characters specified within the square brackets.
> str_replace(“I like mangoes, oranges and pineapples”, “s”, “sss”)
[1] “I like mangoesss, oranges and pineapples”
> str_replace_all(“I like mangoes, oranges and pineapples”, “s”, “sss”)
[1] “I like mangoesss, orangesss and pineapplesss”
> str_replace_all(“I like mangoes, oranges and pineapples”, “[ao]”, “-”)
[1] “I like m-ng-es, -r-nges -nd pine-pples”

In the example below, the various ways of storing the gender values are
transformed into one way, ignoring the case differences. This is done using the
str_replace() function and the fixed() functions that ignores the case.
95 Data Preparation

> gender <- c(“MALE”, “Male”, “male”, “FEMALE”, “Female”, “female”)


> clean_gender <- str_replace(gender, fixed(“male”, ignore_case = TRUE), “Male”)
> clean_gender <- str_replace(clean_gender, fixed(“female”, ignore_case = TRUE),
male”)
> clean_gender
[1] “Male” “Male” “Male” “Female” “Female” “Female”

3.4.2. Manipulating Data Frames

To add a column to a data frame, we can use the below command to achieve this.
> name <- c(“Jhon”, “Peter”, “Mark”)
> start_date <- c(“1980-10-10”, “1999-12-12”, “1990-04-05”)
> end_date <- c(“1989-03-08”, “2004-09-20”, “2000-09-25”)
> service <- data.frame(name, start_date, end_date)
> service
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
2 Peter 1999-12-12 2004-09-20
3 Mark 1990-04-05 2000-09-25
> service$period <- as.Date(service$end_date) - as.Date(service$start_date)
> service
name start_date end_date period
1 Jhon 1980-10-10 1989-03-08 3071 days
2 Peter 1999-12-12 2004-09-20 1744 days
3 Mark 1990-04-05 2000-09-25 3826 days

The same can be achieved using the function with() as below.


> service$period <- with(service, as.Date(end_date) - as.Date(start_date))
> service
96 R Programming — An Approach for Data Analytics

name start_date end_date period


1 Jhon 1980-10-10 1989-03-08 3071 days
2 Peter 1999-12-12 2004-09-20 1744 days
3 Mark 1990-04-05 2000-09-25 3826 days

Another way of doing the same is using the function within(). But, the difference
lies when there are multiple columns to be added to a data frame, we can easily do
the same using the within() function in a single command and this is not possible
using the with() function.
> service <- within(service,
{
period <- as.Date(end_date) - as.Date(start_date)
highperiod <- period > 2000
})
> service
name start_date end_date period highperiod
1 Jhon 1980-10-10 1989-03-08 3071 days TRUE
2 Peter 1999-12-12 2004-09-20 1744 days FALSE
3 Mark 1990-04-05 2000-09-25 3826 days TRUE

The mutate() function in the plyr package also does the same function as the
function within(), but the syntax is slightly different.
> library(plyr)
> service <- mutate(service,
{
period = as.Date(end_date) - as.Date(start_date)
highperiod = period > 2000
})
> service
97 Data Preparation

name start_date end_date period highperiod


1 Jhon 1980-10-10 1989-03-08 3071 days TRUE
2 Peter 1999-12-12 2004-09-20 1744 days FALSE
3 Mark 1990-04-05 2000-09-25 3826 days TRUE

The function complete.cases() returns the number of rows in a data frame that
is free of missing values. The function na.omit() will remove the rows with missing
values in a data frame. And the function na.fail() throws an error message if the
data frame contains any missing values.
> crime.data <- read.csv(“F:/Crimes.csv”)
> nrow(crime.data)
[1] 65535
> complete <- complete.cases(crime.data)
> nrow(crime.data[complete, ])
[1] 63799
> clean.crime.data <- na.omit(crime.data)
> nrow(clean.crime.data)
[1] 63799

A data frame can be transformed by choosing few of the columns and ignoring
the remaining, but considering all the rows as in the example below.
> crime.data <- read.csv(“F:/Crimes.csv”)
> colnames(crime.data)
[1] “CASE.” “DATE..OF.OCCURRENCE” “BLOCK”
[4] “IUCR” “PRIMARY.DESCRIPTION”
“SECONDARY.DESCRIPTION”
[7] “LOCATION.DESCRIPTION” “ARREST” “DOMESTIC”
[10] “BEAT” “WARD” “FBI.CD”
[13] “X.COORDINATE” “Y.COORDINATE” “LATITUDE”
[16] “LONGITUDE” “LOCATION”
98 R Programming — An Approach for Data Analytics

> crime.data1 <- crime.data[, 1:6]


> colnames(crime.data1)
[1] “CASE.” “DATE..OF.OCCURRENCE” “BLOCK”
[4] “IUCR” “PRIMARY.DESCRIPTION”
“SECONDARY.DESCRIPTION”

Alternatively, the data frame can be transformed by selecting only the required
rows and retaining all columns of a data frame as in the example below.
> nrow(crime.data)
[1] 65535
> crime.data2 <- crime.data[1:10,]
> nrow(crime.data2)
[1] 10

The function sort() sorts the given vector of numbers or strings. It generally
sorts from smallest to largest, but this can be altered using the argument decreasing
= TRUE.
> x <- c(5, 10, 3, 15, 6, 8)
> sort(x)
[1] 3 5 6 8 10 15
> sort(x, decreasing = TRUE)
[1] 15 10 8 6 5 3
> y <- c(“X”, “AB”, “Deer”, “For”, “Moon”)
> sort(y)
[1] “AB” “Deer” “For” “Moon” “X”
> sort(y, decreasing = TRUE)
[1] “X” “Moon” “For” “Deer” “AB”

The function order() is the inverse of the sort() function. It returns the index
of the vector elements in the order as below. But, x[order(x)] is same as sort(x). This
can be seen by the use of the identical() function.
99 Data Preparation

> order(x)
[1] 3 1 5 6 2 4
> x[order(x)]
[1] 3 5 6 8 10 15
> identical(sort(x), x[order(x)])
[1] TRUE

The order() function is more useful than the sort() function as it can be used to
manipulate the data frames easily.
> name <- c(“Jhon”, “Peter”, “Mark”)
> start_date <- c(“1980-10-10”, “1999-12-12”, “1990-04-05”)
> end_date <- c(“1989-03-08”, “2004-09-20”, “2000-09-25”)
> service <- data.frame(name, start_date, end_date)
> service
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
2 Peter 1999-12-12 2004-09-20
3 Mark 1990-04-05 2000-09-25
> startdt <- order(service$start_date)
> service.ordered <- service[startdt, ]
> service.ordered
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
3 Mark 1990-04-05 2000-09-25
2 Peter 1999-12-12 2004-09-20

The arrange() function of the plyr package does the same function as above.
> library(plyr)
> arrange(service, start_date)
100 R Programming — An Approach for Data Analytics

name start_date end_date


1 Jhon 1980-10-10 1989-03-08
2 Mark 1990-04-05 2000-09-25
3 Peter 1999-12-12 2004-09-20

The rank() function lists the rank of the elements in a vector or a data frame.
By specifying the argument ties.method = “first”, a rank need not be shared among
more than one element with the same value.
> x <- c(9, 5, 4, 6, 4, 5)
> rank(x)
[1] 6.0 3.5 1.5 5.0 1.5 3.5
> rank(x, ties.method = “first”)
[1] 6 3 1 5 2 4

The SQL statements can be executed from R and the results can be obtained
as in any other database. The package sqldf needs to be installed to manipulate the
data frames or datasets using SQL.
> install.packages(“sqldf ”)
> library(sqldf)
> query <- “SELECT * FROM iris WHERE Species = ‘setosa’”
> sqldf(query)

3.4.3. Data Reshaping

Data Reshaping in R is about changing the way data is organized into rows and
columns. Most of the time data processing in R is done by taking the input data as a
data frame. It is easy to extract data from the rows and columns of a data frame. But
there are situations when we need the data frame in a different format than what we
received. R has few functions to split, merge and change the columns to rows and vice-
versa in a data frame.

The cbind() function can be used to join multiple vectors to create a data frame.
We can also merge two data frames using the rbind() function.
101 Data Preparation

> city <- c(“Tampa”,”Seattle”,”Hartford”,”Denver”)


> state <- c(“FL”,”WA”,”CT”,”CO”)
> zipcode <- c(33602,98104,06161,80294)
> addresses <- cbind(city,state,zipcode)
> addresses
city state zipcode
[1,] “Tampa” “FL” “33602”
[2,] “Seattle” “WA” “98104”
[3,] “Hartford” “CT” “6161”
[4,] “Denver” “CO” “80294”
> new.address <- data.frame(
+ city = c(“Lowry”,”Charlotte”),
+ state = c(“CO”,”FL”),
+ zipcode = c(“80230”,”33949”),
+ stringsAsFactors = FALSE
+)
> print(new.address)
city state zipcode
1 Lowry CO 80230
2 Charlotte FL 33949
> all.addresses <- rbind(addresses,new.address)
> all.addresses
city state zipcode
1 Tampa FL 33602
2 Seattle WA 98104
3 Hartford CT 6161
4 Denver CO 80294
102 R Programming — An Approach for Data Analytics

5 Lowry CO 80230
6 Charlotte FL 33949

The merge() function can be used to merge two data frames. The merging
requires the data frames to have same column names on which the merging is done.
In the example below, we consider the data sets about Diabetes in Pima Indian
Women available in the library named “MASS”. The two datasets are merged based
on the values of blood pressure (“bp”) and body mass index (“bmi”). On choosing
these two columns for merging, the records where values of these two variables
match in both data sets are combined together to form a single data frame.
> library(MASS)
> head(Pima.te)
npreg glu bp skin bmi ped age type
1 6 148 72 35 33.6 0.627 50 Yes
2 1 85 66 29 26.6 0.351 31 No
3 1 89 66 23 28.1 0.167 21 No
...
> head(Pima.tr)
npreg glu bp skin bmi ped age type
1 5 86 68 28 30.2 0.364 24 No
2 7 195 70 33 25.1 0.163 55 Yes
3 5 77 82 41 35.8 0.156 35 No
...
> nrow(Pima.te)
[1] 332
> nrow(Pima.tr)
[1] 200
> merged.Pima <- merge(x = Pima.te, y = Pima.tr,
+ by.x = c(“bp”, “bmi”),
+ by.y = c(“bp”, “bmi”)
103 Data Preparation

+)
> head(merged.Pima)
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20
2 64 29.7 2 75 24 0.370 33 No 2 100 23
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13
...
ped.y age.y type.y
1 0.088 31 No
2 0.368 21 No
3 0.295 24 No
...
> nrow(merged.Pima)
[1] 17

One of the most interesting aspects of R programming is about changing the


shape of the data in multiple steps to get a desired shape. The functions used to do
this are called melt() and cast(). We consider the dataset called ships present in the
library called “MASS”.
> library(MASS)
> head(ships)
type year period service incidents
1 A 60 60 127 0
2 A 60 75 63 0
3 A 65 60 1095 3
...

Now we melt the data using the melt() function in the package reshape2 to
organize it, converting all columns other than type and year into multiple rows.
104 R Programming — An Approach for Data Analytics

> library(reshape2)
> molten.ships <- melt(ships, id = c(“type”,”year”))
> head(molten.ships)
type year variable value
1 A 60 period 60
2 A 60 period 75
3 A 65 period 60
4 A 65 period 75
5 A 70 period 60
6 A 70 period 75
> nrow(molten.ships)
[1] 120
> nrow(ships)
[1] 40

We can cast the molten data into a new form where the aggregate of each type
of ship for each year is created. It is done using the cast() function.
> recasted.ship <- cast(molten.ships, type+year~variable,sum)
> head(recasted.ship)
type year period service incidents
1 A 60 135 190 0
2 A 65 135 2190 7
3 A 70 135 4865 24
4 A 75 135 2244 11
5 B 60 135 62058 68
6 B 65 135 48979 111
105 Data Preparation

3.4.4. Grouping Functions

R has many apply functions such as apply(), lapply(), sapply(), vapply(), mapply(),
rapply(), tapply(), aggregate() and by(). Function lapply() is a list apply which acts
on a list or vector and returns a list. Function sapply() is a simple lapply() function
defaults to returning a vector or matrix when possible. Function vapply() is a verified
apply() function that allows the return object type to be pre-specified. Function
rapply() is a recursive apply for nested lists, i.e. lists within lists. Function tapply()
is a tagged apply where the tags identify the subsets. Function apply() is generic,
applies a function to a matrix’s rows or columns or, more generally, to dimensions
of an array.

If we want to apply a function to the rows or columns of a matrix or array, we


use the apply() function as below.
> M <- matrix(seq(1,16), 4, 4)
>M
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16

> apply(M, 1, min)


[1] 1 2 3 4

> apply(M, 2, max)


[1] 4 8 12 16
> M <- array( seq(32), dim = c(4,4,2))
>M
,,1
106 R Programming — An Approach for Data Analytics

[,1] [,2] [,3] [,4]


[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
,,2
[,1] [,2] [,3] [,4]
[1,] 17 21 25 29
[2,] 18 22 26 30
[3,] 19 23 27 31
[4,] 20 24 28 32

> apply(M, 1, sum)


[1] 120 128 136 144

> apply(M, c(1,2), sum)


[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48

If we want to apply a function to each element of a list in turn and get a list
back, we use the lapply() function as below.
> x <- list(a = 1, b = 1:3, c = 10:100)
>x
$a
[1] 1
$b
[1] 1 2 3
107 Data Preparation

$c
[1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[18] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[35] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[52] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
[69] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[86] 95 96 97 98 99 100

> lapply(x, FUN = length)


$a
[1] 1
$b
[1] 3
$c
[1] 91

> lapply(x, FUN = sum)


$a
[1] 1
$b
[1] 6
$c
[1] 5005

We use the function sapply(), if we want to apply a function to each element of


a list in turn, and we want a vector back.
> x <- list(a = 1, b = 1:3, c = 10:100)
>x
$a
[1] 1
108 R Programming — An Approach for Data Analytics

$b
[1] 1 2 3
$c
[1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[18] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[35] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[52] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
[69] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[86] 95 96 97 98 99 100

> sapply(x, FUN = length)


a b c
1 3 91

> sapply(x, FUN = sum)


a b c
1 6 5005

When we want to use the function sapply(), but need to squeeze some more speed
out of the code, we use the function vapply() as below. For the function vapply(), we
give R the information on what the function will return, which can save some time
coercing returned values to fit in a single atomic vector. In the example below, we tell
R that everything returned by length() should be an integer of length 1.
> x <- list(a = 1, b = 1:3, c = 10:100)
> vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91

For when we have several data structures (e.g. vectors, lists) and we want to
apply a function to the 1st elements of each, and then the 2nd elements of each,
etc., coercing the result to a vector/array we use the function vapply() as below.
109 Data Preparation

> mapply(sum, 1:5, 1:5, 1:5)


[1] 3 6 9 12 15
> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4

When we want to apply a function to each element of a nested list structure,


recursively, we use the function rapply() as below. The function rapply() can be best
illustrated with a user defined function to be applied.
> myFun <- function(x){
+ if (is.character(x)){
+ return(paste(x,”!”,sep=””))
+ }
+ else{
+ return(x + 1)
+ }
+}

> l <- list(a = list(a1 = “Boo”, b1 = 2, c1 = “Eeek”),


+ b = 3, c = “Yikes”,
+ d = list(a2 = 1, b2 = list(a3 = “Hey”, b3 = 5)))

> rapply(l, myFun)


110 R Programming — An Approach for Data Analytics

a.a1 a.b1 a.c1 b c d.a2 d.b2.a3 d.b2.b3


“Boo!” “3” “Eeek!” “4” “Yikes!” “2” “Hey!” “6”
> rapply(l, myFun, how = “replace”)
$a
$a$a1
[1] “Boo!”
$a$b1
[1] 3
$a$c1
[1] “Eeek!”
$b
[1] 4
$c
[1] “Yikes!”
$d
$d$a2
[1] 2
$d$b2
$d$b2$a3
[1] “Hey!”
$d$b2$b3
[1] 6

When we want to apply a function to subsets of a vector and the subsets are
defined by some other vector, usually a factor, we use the function tapply() as below.
> x <- 1:20

>x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
111 Data Preparation

> y <- factor(rep(letters[1:5], each = 4))


>y
[1] a a a a b b b b c c c c d d d d e e e e
Levels: a b c d e

> tapply(x, y, sum)


a b c d e
10 26 42 58 74

The by() function, can be thought of, as a “wrapper” for the function tapply().
When we want to compute a task that tapply() can’t handle, the by() function
arises.
> cta <- tapply(iris$Sepal.Width , iris$Species , summary )
> cba <- by(iris$Sepal.Width , iris$Species , summary )
> cta
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
> cba
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
------------------------------------------------------
112 R Programming — An Approach for Data Analytics

iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800

If we print these two objects, cta and cba, we have the same results. The only
differences are in how they are shown with the different class attributes. The power
of the function by() arises when we can’t use the function tapply() as in the following
code.
> tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length

R says that arguments must have the same lengths, say “we want to calculate
the summary of all variable in iris along the factor Species”: but R just can’t do
that because it does not know how to handle. The by() function lets the summary()
function work even if the length of the first argument are different.
> bywork <- by(iris, iris$Species, summary )
> bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
Median :5.000 Median :3.400 Median :1.500 Median :0.200
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
113 Data Preparation

Species
setosa :50
versicolor: 0
virginica : 0
------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200
Median :5.900 Median :2.800 Median :4.35 Median :1.300
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
Species
setosa :0
versicolor:50
virginica : 0
------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
Median :6.500 Median :3.000 Median :5.550 Median :2.000
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
Species
setosa :0
114 R Programming — An Approach for Data Analytics

versicolor: 0
virginica :50

The arguments must have the same lengths. R can’t do that because it does not
know how to handle it. The by() function lets the summary() function work even if
the length of the first argument is different. The result is an object of class by that
along Species computes the summary of each variable.

The aggregate() function can be seen as another a different way of using tapply()
function if we use it in such a way.
> att <- tapply(iris$Sepal.Length , iris$Species , mean)
> agt <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
> att
setosa versicolor virginica
5.006 5.936 6.588
> agt
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588

The two immediate differences are that the second argument of the aggregate()
function must be a list while tapply() function can (not mandatory) be a list and
that the output of the aggregate() function is a data frame while the one of tapply()
function is an array. The power of the aggregate() function is that it can handle
easily subsets of the data with subset argument and that it can handle formula
as well. These elements make the aggregate() function easier to work with than
tapply() function in some situations.
> ag <- aggregate(len ~ ., data = ToothGrowth, mean)
> ag
115 Data Preparation

supp dose len


1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14

 HIGHLIGHTS
 One of the packages in R is datasets which is filled with example datasets.
 We can see all the datasets available in the loaded packages using the
data() function.
 The read.table() function reads the CSV files and stores the result in a
data frame.
 The system.file() function is used to locate files that are inside a package.
 Writing data from R into a file is done using the functions write.table()
and write.csv().
 If the file is unstructured, it is read using the function readLines().
 The writeLines() function takes a text line and the file name as argument
and writes the text to the file.
 The XML file can be imported using the function xmlParse() function.
 The function used to import the JSON file is fromJSON() and the function
used to export the JSON file is toJSON().
 Spreadsheets can be imported with the functions read.xlsx() and read.
xlsx2().
 To write to an excel file from R we use the function write.xlsx2().
 The read.ssd() function is used to read SAS datasets.
 The read.spss() function is used to import the SPSS data files.
 The MATLAB binary data files can be read and written using the readMat()
and writeMat() functions in the R.matlab package.
116 R Programming — An Approach for Data Analytics

 R can connect to all DBMS like SQLite, MySQL, MariaDB, PostgreSQL


and Oracle using the DBI package.
 The function dbReadTable() reads a table from the connected database.
 The functions grep() and grepl() are used to find a pattern in a given text.
 The functions sub() and gsub() are used to replace a pattern with another
in a given text.
 The function str_split() is used to split a given text based on the pattern
specified.
 The function str_count() can be used to count the number of occurrence
of a given pattern in the given text.
 The function str_replace() can be used to replace the specified pattern
with another pattern in the given text.
 The function na.omit() will remove the rows with missing values in a data
frame.
 The function sort() sorts the given vector of numbers or strings.
 The function order() is the inverse of the sort() function.
 The rank() function lists the rank of the elements in a vector or a data
frame.
 The package sqldf needs to be installed to manipulate the data frames or
datasets using SQL.
 Changing the shape of a data frame is done using the functions melt() and
cast().
 R has many grouping functions such as apply(), lapply(), sapply(), vapply(),
mapply(), rapply(), tapply(), aggregate() and by().
CHAPTER 4

Graphics using R

 OBJECTIVES

On completion of this Chapter you will be able to:


 understand what is Exploratory Data Analysis
 introduce about the main graphical packages
 draw pie charts using R
 draw scatter plots using R
 draw line plots using R
 draw histograms using R
 draw box plots using R
 draw bar plots using R
 know about other existing graphical packages

4.1. Exploratory Data Analysis


Exploratory Data Analysis (EDA) is a visual based method used to analyse data
sets and to summarize their main characteristics. Exploratory Data Analysis (EDA)
shows how to use visualisation and transformation to explore data in a systematic
way. EDA is an iterative cycle of the below steps:
1) Generate questions about data.
2) Search for answers by visualising, transforming, and modelling data.
3) Use what is learnt to refine questions and/or generate new questions.
118 R Programming — An Approach for Data Analytics

Exploratory Data Analysis (EDA) is an approach for data analysis that employs
a variety of techniques (mostly graphical) to:
1) Maximize insight into a data set
2) Uncover underlying structure
3) Extract important variables
4) Detect outliers and anomalies
5) Test underlying assumptions
6) Develop parsimonious models
7) Determine optimal factor settings.

4.2. Main Graphical Packages


The basic graphs in R can be drawn using the base graphics system. These have
some limitations and they are overcome in the next level of graphics called the grid
graphics system. This system allows to plot the points or lines in the place where
desired. But, this does not allow us to draw a scatter plot. Hence, we go for the next
level of plotting which the lattice graphics system is. In this system, the results of
a plot can be saved. Also these scatter plots can contain multiple panels in which
we can draw multiple graphs and compare them to each other. The next levels of
graphs are the ggplot2 graphics system. In this the “gg” stands for “grammar of
graphics”. This breaks down the graphs into many parts or chunks.

4.3. Pie Charts


In R the pie chart is created using the pie() function which takes positive numbers
as vector input. The additional parameters are used to control labels, colour, title
etc. The basic syntax for creating a pie-chart is as given below and the explanation
of the parameters are also listed.
pie(x, labels, radius, main, col, clockwise)

x – numeric vector
labels – description of the slices
radius – values between [-1 to +1]
119 Graphics using R

main – title of the chart


col – colour palette
clockwise – logical value – TRUE (Clockwise), FALSE (Anti Clockwise)

> x <- c(25, 35, 10, 5, 15)


> labels <- c(“Rose”, “Lotus”, “Lilly”, “Sunflower”, “Jasmine”)
> pie(x, labels = percent, main = “Flowers”, col = rainbow(length(x)))
> legend(“topright”, c(“Rose”, “Lotus”, “Lilly”, “Sunflower”, “Jasmine”),
cex = 0.8, fill = rainbow(length(x)))

Figure 4.1 Pie Chart of Flowers

A 3D Pie Chart can be drawn using the package plotrix which uses the function
pie3D().
> install.packages(“plotrix”)
> library(plotrix)
> pie3D(x, labels = labels, explode = 0.1, main = “Flowers”)
120 R Programming — An Approach for Data Analytics

Figure 4.2 3-D Pie Chart of Flowers

4.4. Scatter Plots


Scatter plots are used for exploring the relationship between the two continuous
variables. Let us consider the dataset “cars” that lists the “Speed and Stopping
Distances of Cars”. The basic scatter plot in the base graphics system can be
obtained by using the plot() function as in Fig. 4.3. The below example compares if
the speed of a car has effect on its stopping distance using the plot.
> colnames(cars)
[1] “speed” “dist”
> plot(cars$speed, cars$dist)
121 Graphics using R

Figure 4.3 Basic Scatter Plot of Car Speed Vs Distance

This plot can be made more appealing and readable by adding colour and
changing the plotting character. For this we use the arguments col and pch (can
take the values between 1 and 25) in the plot() function as below. Thus the plot in
Fig. 4.4 shows that there is a strong positive correlation between the speed of a car
and its stopping distance.
> plot(cars$speed, cars$dist, col = “red”, pch = 15)

Figure 4.4 Coloured Scatter Plot of Car Speed Vs Distance


122 R Programming — An Approach for Data Analytics

The layout() function is used to control the layout of multiple plots in the
matrix. Thus in the example below multiple related plots are placed in a single
figure as in Fig. 4.5.
> data(mtcars)
> layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
> plot(mtcars$wt, mtcars$mpg, col = “blue”, pch = 17)
> plot(mtcars$wt, mtcars$disp, col = “red”, pch = 15)
> plot(mtcars$mpg, mtcars$disp, col = “dark green”, pch = 10)
> plot(mtcars$mpg, mtcars$hp, col = “violet”, pch = 7)

Figure 4.5 Layout of Multiple Scatter Plots

When we have more than two variables and we want to find the correlation
between one variable versus the remaining ones we use scatter plot matrix. We use
pairs() function to create matrices of scatter plots as in Fig. 4.6. The basic syntax for
creating scatter plot matrices in R is as below.
pairs(formula, data)

> pairs(~wt+mpg+disp+cyl,data = mtcars, main = “Scatterplot Matrix”)


123 Graphics using R

Figure 4.6 Scatter Plot Matrix Using pairs()

The lattice graphics system has equivalent of plot() function and it is xyplot().
This function uses a formula to specify the x and y variables (yvar ~ xvar) and a data
frame argument. To use this function, it is required to include the lattice package.
> library(lattice)
> xyplot(mtcars$mpg ~ mtcars$disp, mtcars, col = “purple”, pch = 7)

Figure 4.7 Scatter Plot Matrix Using xyplot()


124 R Programming — An Approach for Data Analytics

Axis scales can be specified in the xyplot() using the scales argument and this
argument must be a list. This list consists of the name = value pairs. If we mention
log = TRUE, the log scales for the x and y axis are set as in Fig. 4.8. The scales list
can take other arguments also like the x and y that sets the x and y axes respectively.
> xyplot(mtcars$mpg ~ mtcars$disp, mtcars, scales = list(log = TRUE),
col = “red”, pch = 11)

Figure 4.8 Scatter Plot Matrix with Axis Scales Using xyplot()

The data in the graph can be split based on one of the columns in the dataset
namely mtcars$carb. This can be done by appending the pipe symbol (|) along with
the column name used for splitting. The argument relation = “same” means that
each panel shares the same axes. If the argument alternating = TRUE, axis ticks for
each panel is drawn on alternating sides of the plot as in Fig. 4.9.
> xyplot(mtcars$mpg ~ mtcars$disp | mtcars$carb, mtcars,
scales = list(log = TRUE, relation = “same”, alternating = FALSE),
layout = c(3, 2), col = “blue”, pch = 14)
125 Graphics using R

Figure 4.9 Scatter Plot Split on a Column

The lattice plots can be stored in variables and hence they can be further
updated using the function update as below.
> graph1 <- xyplot(mtcars$mpg ~ mtcars$disp | mtcars$carb, mtcars,
scales = list(log = TRUE, relation = “same”, alternating = FALSE),
layout = c(3, 2), col = “blue”, pch = 14)
> graph2 <- update(graph1, col = “yellow”, pch = 6)

In the ggplot2 graphics, each plot is drawn with a call to the ggplot() function
as in Fig. 4.10. This function takes a data frame as its first argument. The passing
of data frame columns to the x and y axis is done using the aes() function which is
used within the ggplot() function. The other aesthetics to the graph are then added
using the geom() function appended with a “+” symbol to the ggplot() function.
> library(ggplot2)
> ggplot(mtcars, aes(mpg, disp)) +
geom_point(color = “purple”, shape = 16, cex = 2.5)
126 R Programming — An Approach for Data Analytics

Figure 4.10 Scatter Plot Using ggplot()

The ggplots can also be split into several panels like the lattice plots as in Fig. 4.11.
This is done using the function facet_wrap() which takes a formula of the column
used for splitting. The function theme() is used to specify the orientation of the
axis readings. The functions facet_wrap() and theme() are appended to the ggplot()
function using the “+” symbol. The ggplots can be stored in a variable like the lattice
plots and as usual wrapping the expression in parentheses makes it to auto print.
> (graph1 <- ggplot(mtcars, aes(mpg, disp)) +
geom_point(color = “dark green”, shape = 15, cex = 3))
> (graph2 <- graph1 + facet_wrap(~mtcars$cyl, ncol = 3) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)

Figure 4.11 – Scatter Plot Split into Panels Using ggplot()


127 Graphics using R

4.5. Line Plots


A line chart / line plot is a graph that connects a series of points by drawing line
segments between them. Line charts are usually used in identifying the trends in
data. The plot() function in R is used to create the line graph in base graphics as in
Fig. 4.12. This function takes a vector of numbers as input together with few more
parameters listed below.
plot(v, type, col, xlab, ylab)

v – numeric vector
type - takes value “p” (only points), or “l” (only lines) or “o” (both points and lines)
xlab – label of x-axis
ylab – label of y-axis
main - title of the chart
col - colour palette

Figure 4.12 Line Plot Using Basic Graphics


> male <- c(1000, 2000, 1500, 4000, 800)
> female <- c(700, 300, 600, 1200, 800)
> child <- c(1000, 1200, 1500, 800, 2000)
> wages <- c(“Male”, “Female”, “Children”)
> color = c(“red”, “blue”, “green”)
128 R Programming — An Approach for Data Analytics

> plot(male, type = “o”, col = “red”, xlab = “Month”, ylab = “Wages”,
main = “Monthly Wages”, ylim = c(0, 5000))
> lines(female, type = “o”, col = “blue”)
> lines(child, type = “o”, col = “green”)
> legend(“topleft”, wages, cex = 0.8, fill = color)

Line plots in the lattice graphics uses the xyplot() function as in Fig. 4.13. In
this multiple lines can be creating using the “+” symbol in the formula where the
x and the y axes are mentioned. The argument type = “l” is used to mention that
it is a continuous line.
> xyplot(economics$pop + economics$unemploy ~ economics$date, economics, type = “l”)

Figure 4.13 Line Plot Using Lattice Graphics

In the ggplot2 graphics, the same syntax for scatter plots are used, except for
the change of geom_plot() function with the geom_line() function as in Fig. 4.14.
But, there need to be multiple geom_line() functions for multiple lines to be drawn
in the graph.
> ggplot(economics, aes(economics$date)) + geom_line(aes(y = economics$pop)) +
geom_line(aes(y = economics$unemploy))
129 Graphics using R

Figure 4.14   Line Plot Using ggplot2 Graphics

The plot in the Fig. 4.15 can be drawn without using multiple geom_line()
functions also. This is possible using the function geom_ribbon() as mentioned
below. This function plots not only the two lines, but also the contents in between
the two lines.
> ggplot(economics, aes(economics$date, ymin = economics$unemploy,
ymax = economics$pop)) + geom_ribbon(color = “blue”, fill = “white”)

Figure 4.15 Line Plot Using geom_ribbon()


130 R Programming — An Approach for Data Analytics

4.6. Histograms
Histograms represents the variable values frequencies, that are split into ranges. This
is similar to bar charts, but histograms group values into continuous ranges. In R
histograms in the base graphics are drawn using the function hist() as in the Fig. 4.16,
that takes a vector of numbers as input together with few more parameters listed below.
hist(v, main, xlab, xlim, ylim, breaks, col, border)
v – numeric vector main - title of the chart
col - colour palette border – border colour
xlab – label of x-axis xlim – range of x-axis
ylim – range of y-axis breaks – width of each bar

Figure 4.16 Histogram Using Base Graphics


> x <- c(45, 33, 31, 23, 58, 47, 39, 58, 28, 55, 42, 27)
> hist(x, xlab = “Age”, col = “blue”, border = “red”, xlim = c(25, 60),
ylim = c(0, 3), breaks = 5)

The lattice histogram is drawn using the function histogram() as in Fig. 4.17 and
it behaves in the same way as the base ones. But it allows easy splitting of data into
panels and saving plots as variables. The breaks argument behaves the same way as with
hist(). The lattice histograms support counts, probability densities, and percentage
y-axes via the type argument, which takes the string “count”, “density”, or “percent”.
131 Graphics using R

> histogram(~ mtcars$mpg, mtcars, breaks = 10)

Figure 4.17 Histogram Using Lattice Graphics

The ggplot histograms are created by adding the function geom_histogram() to


the ggplot() function as in Fig. 4.18. Bin specification is simple here, we just need
to pass a numeric bin width to geom_histogram() function. It is possible to choose
between counts and densities by passing the special names ..count.. or ..density.. to
the y-aesthetic.
> ggplot(mtcars, aes(mtcars$mpg, ..density..)) + geom_histogram(binwidth = 5)

Figure 4.18 Histogram Using ggplot2 Graphics


132 R Programming — An Approach for Data Analytics

4.7. Box Plots


The box plot divides the data into three quartiles. This graph represents the
minimum, maximum, median, first quartile and third quartile in the data. This
shows the data distribution by drawing the box plots. In R base graphics the box
plot is created using the boxplot() function as in Fig. 4.19, which takes the following
parameters. The parameters are used to give the data as a data frame, a vector or
a formula, a logical value to draw a notch, a logical value to draw a box as per the
width of the sample, give title of the chart, labels for the boxes. The basic syntax
for creating a box-plot is as given below and the explanation of the parameters are
also listed.
boxplot(x, data, notch, varwidth, names, main)

x – vector or a formula
data – data frame
notch – logical value (TRUE – draw a notch)
varwidth – logical value (TRUE – box width proportionate to sample size
names – labels printed under the boxes
main – title of the chart

Figure 4.19 Box Plots Using Base Graphics


133 Graphics using R

This type of plot is often clearer if we reorder the box plots from smallest to
largest, in some sense. The reorder() function changes the order of a factor’s levels,
based upon some numeric score.
> boxplot(mpg ~ reorder(gear, mpg, median), data = mtcars,
xlab = “Number of Gears”, ylab = “Miles Per Gallon”,
main = “Car Mileage”, varwidth = TRUE,
col = c(“red”,”blue”, “green”), names = c(“Low”, “Medium”, “High”))

In the lattice graphics the box plot is drawn using the function bwplot() as in
Fig. 4.20.
> bwplot(mpg ~ reorder(gear, mpg, median), data = mtcars,
xlab = “Number of Gears”, ylab = “Miles Per Gallon”,
main = “Car Mileage”, varwidth = TRUE,
col = c(“red”,”blue”, “green”), names = c(“Low”, “Medium”, “High”))

Figure 4.20 Box Plots Using Lattice Graphics

In the ggplot2 graphics the box plot is drawn by adding the function geom_
boxplot() to the function ggplot() as in Fig. 4.21.
> ggplot(mtcars, aes(reorder(gear, mpg, median), mpg)) + geom_boxplot()
134 R Programming — An Approach for Data Analytics

Figure 4.21 Box Plots Using ggplot2 Graphics

4.8. Bar Plots


Bar charts are the natural way of displaying numeric variables split by a categorical
variable. In R base graphics the bar chart is created using the barplot() function
as in Fig. 4.22, which takes a matrix or a vector of numeric values. The additional
parameters are used to give labels to the X-axis, Y-axis, give title of the chart, labels
for the bars and colours. The basic syntax for creating a bar-chart is as given below
and the explanation of the parameters are also listed.
barplot(H, xlab, ylab, main, names.arg, col)
H – numeric vector or matrix
x-lab – label of x-axis
y-lab – label of y-axis
main - title of the chart
names.arg – vector of labels under each bar
col – colour palette
> x <- matrix(c(1000, 900, 1500, 4400, 800, 2100, 1700, 2900, 3800),
nrow = 3, ncol = 3)
> years <- c(“2011”, “2012”, “2013”)
> city <- c(“Chennai”, “Mumbai”, “Kolkata”)
135 Graphics using R

> color <- c(“red”, “blue”, “green”)


> barplot(x, main = “Yearly Sales”, names.arg = years, xlab = “Year”,
ylab = “Sales”, col = color)
> legend(“topleft”, city, cex = 0.8, fill = color)

Figure 4.22 Vertical Bar Plot Using Base Graphics

By default the bars are vertical, but if we want horizontal bars, they can be
generated with horiz = TRUE parameter as in Fig. 4.23. We can also do some
fiddling with the plot parameters, via the par() function. The las parameter controls
whether labels are horizontal, vertical, parallel, or perpendicular to the axes. Plots
are usually more readable if you set las = 1, for horizontal. The mar parameter is a
numeric vector of length 4, giving the width of the plot margins at the bottom/left/
top/right of the plot.
> x <- matrix(c(1000, 900, 1500, 4400, 800, 2100, 1700, 2900, 3800), nrow = 3, ncol = 3)
> years <- c(“2011”, “2012”, “2013”)
136 R Programming — An Approach for Data Analytics

Figure 4.23 Horizontal Bar Plot Using Base Graphics


> city <- c(“Chennai”, “Mumbai”, “Kolkata”)
> color <- c(“red”, “blue”, “green”)
> par(las = 1, mar = c(3, 9, 1, 1))
> barplot(x, main = “Yearly Sales”, names.arg = years,
xlab = “Year”, ylab = “Sales”, col = color, horiz = TRUE)
> legend(“bottomright”, city, cex = 0.8, fill = color)

The lattice equivalent of the function barplot(), is the function barchart() as


shown in Fig. 4.24. The formula interface is the same as those we saw with scatter
plots, yvar ~ xvar.
> barchart(mtcars$mpg ~ mtcars$disp, mtcars)
137 Graphics using R

Figure 4.24 Horizontal Bar Plot Using Lattice Graphics

Extending this to multiple variables just requires a tweak to the formula, and
passing stack = TRUE to make a stacked plot as in Fig. 4.25.
> barchart(mtcars$mpg ~ mtcars$disp + mtcars$qsec + mtcars$hp, mtcars,
stack = TRUE

Figure 4.25 Horizontal Stacked Bar Plot Using Lattice Graphics


138 R Programming — An Approach for Data Analytics

In the ggplot2 graphics the bar chart is drawn by adding the function geom_
bar() to the function ggplot() as in Fig. 4.26. Like base, ggplot2 defaults to vertical
bars; adding the function coord_flip() swaps this. We must pass the argument stat
= “identity” to the function geom_bar().
> ggplot(mtcars, aes(mtcars$mpg, mtcars$disp)) + geom_bar(stat = “identity”) +
coord_flip()

Figure 4.26 Horizontal Bar Plot Using ggplot2 Graphics

4.9. Other Graphical Packages

Package Name Description


Plots for visualizing categorical data, such as mosaic plots
Vcd
and association plots
Plotrix Loads of extra plot types
latticeExtra Extends the lattice package
GGally Extend the ggplot2 package
Provide access to the underlying framework of lattice and
Grid
ggplot2 packages
139 Graphics using R

Package Name Description


gridSVG Write grid-based plots to SVG files
Allows pointing and clicking to interact with base or lattice
Playwith
plots
Provides a whole extra system of plots with more
Iplots
interactivity
Provides an R wrapper around Google Chart Tools, creating
googleVis
plots that can be displayed in a browser.
Provides an interface to GGobi package (for visualizing
Rggobi
high-dimensional data)
Open source visualization program for exploring high-
GGobi
dimensional data.
Rgl Provides an interface to OpenGL for interactive 3D plots
Animation Lets to make animated GIFs or SWF animations
Provides wrappers to half a dozen JavaScript plotting
rCharts
libraries using lattice syntax.

 HIGHLIGHTS
 Exploratory Data Analysis (EDA) shows how to use visualisation and
transformation to explore data in a systematic way.
 The main graphical packages are base, lattice and ggplot2.
 In R the pie chart is created using the pie() function.
 A 3D Pie Chart can be drawn using the package plotrix which uses the
function pie3D().
 The basic scatter plot in the base graphics system can be obtained by
using the plot() function.
 We use the arguments col and pch (values between 1 and 25) in the plot()
function to specify colour and plot pattern.
 The layout() function is used to control the layout of multiple plots in the
matrix.
140 R Programming — An Approach for Data Analytics

 We use pairs() function to create matrices of scatter plots.


 The lattice graphics system has equivalent of plot() function and it is
xyplot().
 In the ggplot2 graphics, each plot is drawn with a call to the ggplot()
function.
 The ggplots can also be split into several panels like the lattice plots and
this is done using the function facet_wrap().
 The plot() function in R is used to create the line graph in base graphics
and the argument type = “l” is used to mention that it is a line.
 Line plots in the lattice graphics uses the xyplot() function and the
argument type = “l” is used to mention that it is a line.
 The ggplot2 scatter plot are created by adding the function geom_line() to
the ggplot() function.
 In R histograms in the base graphics are drawn using the function hist().
 The lattice histogram is drawn using the function histogram().
 The ggplot2 histograms are created by adding the function geom_
histogram() to the ggplot() function.
 In R base graphics the box plot is created using the boxplot() function.
 In the lattice graphics the box plot is drawn using the function bwplot().
 In the ggplot2 graphics the box plot is drawn by adding the function
geom_boxplot() to the function ggplot().
 In R base graphics the bar chart is created using the barplot() function.
 The lattice equivalent of the function barplot(), is the function barchart().
 In the ggplot2 graphics the bar chart is drawn by adding the function
geom_bar() to the function ggplot().
CHAPTER 5

Statistical Analysis Using R

 OBJECTIVES

On completion of this Chapter you will be able to:


 obtain the basic statistical measures like mean, median, mode, standard
deviation, variation etc., using R
 obtain the summary statistics of a given data
 understand and plot the normal distribution of data using R functions
 understand and plot the binomial distribution of data using R functions
 perform correlation analysis on the given data using R
 perform regression analysis on the given data using R
 perform ANOVA, ANCOVA on the given data using R
 perform chi-square and hypothesis testing on the given data using R

Statistical analysis in R is performed by using many in-built functions. Most of


these functions are part of the R base package. These functions take R vector as an
input along with the arguments and give the result. The other important R package
for statistical analysis is the stats package.

5.1. Basic Statistical Measures


Any dataset available in R or that is been imported into R for further analysis will
have both categorical data as well as numeric data. So, we can apply the statistical
functions available in R on the numeric data and understand the statistical
measures of the fields. The basic statistical measures are the minimum, maximum,
mean and median represented by the functions min(), max(), mean() and median()
142 R Programming — An Approach for Data Analytics

respectively. Let us use the dataset named mtcars that is available in R by default to
understand these statistical measures.
> data(mtcars)
> colnames(mtcars)
[1] “mpg” “cyl” “disp” “hp” “drat” “wt” “qsec” “vs” “am” “gear” “carb”
> min(mtcars$cyl)
[1] 4
> max(mtcars$cyl)
[1] 8
> mean(mtcars$cyl)
[1] 6.1875
> median(mtcars$cyl)
[1] 6

All the above results can also be obtained by one function summary() and this
can also be applied on all the fields of the dataset at one shot. The range() function
gives the minimum and maximum values of a numeric field at one go.
> summary(mtcars$cyl)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.000 4.000 6.000 6.188 8.000 8.000
> range(mtcars$cyl)
[1] 4 8

5.1.1. Mean

Mean is calculated by taking the sum of the values and dividing with the number of
values in a data series. The function mean() is used to calculate this in R. The basic
syntax for calculating mean in R is given below along with its parameters.
mean(x, trim = 0, na.rm = FALSE, ...)
x - numeric vector
143 Statistical Analysis Using R

trim - to drop some observations from both end of the sorted vector
na.rm - to remove the missing values from the input vector

> x <- c(45, 56, 78, 12, 3, -91, -45, 15, 1, 24)
> mean(x)
[1] 9.8

When trim parameter is supplied, the values in the vector get sorted and then
the required numbers of observations are dropped from calculating the mean.
When trim = 0.2, 2 values from each end will be dropped from the calculations to
find mean. In this case the sorted vector is (-91, -45, 1, 3, 12, 15, 24, 45, 56, 78) and
the values removed from the vector for calculating mean are (−91, −45) from left
and (56, 78) from right.
> mean(x, trim = 0.2)
[1] 16.66667

If there are missing values, then the mean() function returns NA. To drop the
missing values from the calculation use na.rm = TRUE, which means remove the
NA values.
> x <- c(45, 56, 78, 12, 3, -91, NA, -45, 15, 1, 24, NA)
> mean(x)
[1] NA
> mean(x, na.rm = TRUE)
[1] 9.8

5.1.2. Median

The middle most value in a data series is called the median. The median() function
is used in R to calculate this value. The basic syntax for calculating median in R is
given below along with its parameters.
median(x, na.rm = FALSE)
x - numeric vector
na.rm - to remove the missing values from the input vector
144 R Programming — An Approach for Data Analytics

> x <- c(45, 56, 78, 12, 3, -91, -45, 15, 1, 24)
> median(x)
[1] 13.5

5.1.3. Mode

The mode is the value that has highest number of occurrences in a set of data.
Unlike mean and median, mode can have both numeric and character data. R does
not have a standard in-built function to calculate mode. So we create a user function
to calculate mode of a data set in R. This function takes the vector as input and
gives the mode value as output.
Mode <- function(x)
{
y <- unique(x)
y[which.max(tabulate(match(x, y)))]
}

> x <- c(1,2,3,4,5,5,5)


> Mode(x)
[1] 5
> ch <- c(“a”, “e”, “i”, “o”, “u”, “u”, “a”, “a”)
> Mode(ch)
[1] “a”

The function unique() returns a vector, data frame or array like x but with
duplicate elements/rows removed. The function match() returns a vector of
the positions of (first) matches of its first argument in its second. The function
tabulate() takes the integer-valued vector bin and counts the number of times each
integer occurs in it. The function which.max() determines the location, i.e., index
of the (first) maximum of a numeric (or logical) vector.
145 Statistical Analysis Using R

5.1.4. Standard Deviation and Variance

The functions to calculate the standard deviation, variance and the mean absolute
deviation are sd(), var() and mad() respectively.
> sd(mtcars$cyl)
[1] 1.785922
> var(mtcars$cyl)
[1] 3.189516
> mad(mtcars$cyl)
[1] 2.9652

5.1.5. Quartile Ranges

The quantile() function provides the quartiles of the numeric values. An alternative
function for quartiles is fivenum(). The IQR() function provides the inter quartile
range of the numeric fields.
> quantile(mtcars$cyl)
0% 25% 50% 75% 100%
4 4 6 8 8
> fivenum(mtcars$cyl)
[1] 4 4 6 8 8
> IQR(mtcars$cyl)
[1] 4

5.1.6. Other Statistical Functions

The function cor() and cov() are used to find the correlation and covariance between
two numeric fields respectively. In the below example the value shows that there is
negative correlation between the two numeric fields.
> cor(mtcars$mpg, mtcars$cyl)
[1] -0.852162
146 R Programming — An Approach for Data Analytics

> cov(mtcars$mpg, mtcars$cyl)


[1] -9.172379

There are other statistics functions such as pmin(), pmax() [parallel equivalents
of min() and max() respectively], cummin() [cumulative minimum value], cummax()
[cumulative maximum value], cumsum() [cumulative sum] and cumprod()
[cumulative product].
> nrow(mtcars)
[1] 32
> mtcars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> pmin(mtcars$cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> pmax(mtcars$cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> cummin(mtcars$cyl)
[1] 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
> cummax(mtcars$cyl)
[1] 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
> cumsum(mtcars$cyl)
[1] 6 12 16 22 30 36 44 48 52 58 64 72 80 88 96 104 112 116 120 124
[21] 128 136 144 152 160 164 168 172 180 186 194 198
> cumprod(mtcars$cyl)
[1] 6.000000e+00 3.600000e+01 1.440000e+02 8.640000e+02 6.912000e+03
4.147200e+04
[7] 3.317760e+05 1.327104e+06 5.308416e+06 3.185050e+07 1.911030e+08
1.528824e+09
[13] 1.223059e+10 9.784472e+10 7.827578e+11 6.262062e+12 5.009650e+13
2.003860e+14
147 Statistical Analysis Using R

[19] 8.015440e+14 3.206176e+15 1.282470e+16 1.025976e+17 8.207810e+17


6.566248e+18
[25] 5.252999e+19 2.101199e+20 8.404798e+20 3.361919e+21 2.689535e+22
1.613721e+23
[31] 1.290977e+24 5.163908e+24

5.2. Summary Statistics


Thus the summary() function can be applied on the entire dataset to get all the
statistical values of all the numeric fields.
> summary(mtcars)
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930

wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000

carb
Min. :1.000
1st Qu.:2.000
Median :2.000
148 R Programming — An Approach for Data Analytics

Mean :2.812
3rd Qu.:4.000
Max. :8.000

5.3. Normal Distribution


In a random collection of data from independent sources, it is generally observed
that the distribution of data is normal. Which means, on plotting a graph with the
value of the variable in the horizontal axis and the count of the values in the vertical
axis we get a bell shape curve. The centre of the curve represents the mean of the
data set. In the graph, half of values lie to the left of the mean and the other half lie
to the right of the graph. This is referred as normal distribution in statistics. R has
four in-built functions to generate normal distribution. They are described below.
dnorm(x, mean, sd)
pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)

x - vector of numbers
p - vector of probabilities
n - sample size
mean - mean (default value is 0)
sd - standard deviation (default value is 1)

5.3.1. dnorm()

For a given mean and standard deviation, this function gives the height of the
probability distribution. Below is an example in which the result of the dnorm()
function is plotted in a graph in Fig. 5.1.
> x <- seq(-5,5, by = 0.05)
> y <- dnorm(x, mean = 1.5, sd = 0.5)
> plot(x, y)
149 Statistical Analysis Using R

Figure 5.1 Plot of dnorm()

5.3.2. pnorm()

The pnorm() function returns the probability of a normally distributed random


number which is less than the value of a given number. The other name for this is
“Cumulative Distribution Function”. Below is an example in which the result of the
pnorm() function is plotted in a graph as in Fig. 5.2.
> x <- seq(-5,5, by = 0.05)
> y <- pnorm(x, mean = 1.5, sd = 1)
> plot(x, y)

Figure 5.2 Plot of pnorm()


150 R Programming — An Approach for Data Analytics

5.3.3. qnorm()

The qnorm() function takes the probability value as input and returns a cumulative
value that matches the probability value. Below is an example in which the result of
the qnorm() function is plotted in a graph as in Fig. 5.3.
> x <- seq(0, 1, by = 0.02)
> y <- qnorm(x, mean = 2, sd = 1)
> plot(x, y)

Figure 5.3 Plot of qnorm()

5.3.4. rnorm()

This function is used to generate random numbers whose distribution is normal. It


takes the sample size as input and generates that many random numbers. We draw
a histogram to show the distribution of the generated numbers as in Fig. 5.4.
151 Statistical Analysis Using R

Figure 5.4 Histogram Using rnorm()


> x <- rnorm(80)
> hist(x, main = “Normal Distribution”)

5.4. Binomial Distribution


The probability of success of an event is found by the binomial distribution model and
this has only two possible outcomes in a series of experiments. For example, tossing of
a coin always gives a head or a tail. During the binomial distribution, the probability
of finding exactly 3 heads when tossing a coin for 10 times is estimated. R has four
in-built functions to generate binomial distribution. They are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)

x - vector of numbers
p - vector of probabilities
n - sample size
size – number of trials
prob – probability of success of each trial
152 R Programming — An Approach for Data Analytics

5.4.1. dbinom()

This function gives the probability density distribution at each point. Below is an
example in which the result of the dbinom() function is plotted in a graph as in Fig. 5.5.
> x <- seq(0, 25, by = 1)
> y <- dbinom(x,25,0.5)

> plot(x, y)

Figure 5.5 Plot Using dbinorm()

5.4.2. pbinom()

This function gives the cumulative probability of an event. It is a single value


representing the probability. The probability of getting 25 or less heads from a 50
tosses of a coin is given by the below code.
> x <- pbinom(25,50,0.5)
>x
[1] 0.5561376
153 Statistical Analysis Using R

5.4.3. qbinom()

The function qbinom() takes the probability value as input and returns a number
whose cumulative value matches the probability value. The below example finds how
many heads will have a probability of 0.5 will come out when a coin is tossed 50 times.
> x <- qbinom(0.5, 50, 1/2)
>x
[1] 25

5.4.4. rbinom()

The function rbinom() returns the required number of random values of the given
probability from a given sample. The below code is to find 5 random values from a
sample of 50 with a probability of 0.5.
> x <- rbinom(5,50,0.5)
>x
[1] 24 21 22 29 32

5.5. Correlation Analysis


To evaluate the relation between two or more variables, the correlation test is used.
Correlation coefficient in R can be computed using the functions cor() or cor.test().
The basic syntax for the correlation functions in R are as below.
cor(x, y, method)

cor.test(x, y, method)

x, y - numeric vectors with the same length


method - correlation method (“pearson” or “kendall” or “spearman”)

Consider the data set “mtcars” available in the R environment. Let us first find
the correlation between the horse power (“hp”) and the mileage per gallon (“mpg”)
of the cars and then between the horse power (“hp”) and the cylinder displacement
(“disp”) of the cars. From the test we find that the horse power (“hp”) and the
154 R Programming — An Approach for Data Analytics

mileage per gallon (“mpg”) of the cars have negative correlation (-0.7761684) and
the horse power (“hp”) and the cylinder displacement (“disp”) of the cars have
positive correlation (0.7909486).
> cor(mtcars$hp, mtcars$mpg, method = “pearson”)
[1] -0.7761684

> cor.test(mtcars$hp, mtcars$mpg, method = “pearson”)


Pearson’s product-moment correlation
data: mtcars$hp and mtcars$mpg
t = -6.7424, df = 30, p-value = 1.788e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8852686 -0.5860994
sample estimates:
cor
-0.7761684

> cor(mtcars$hp, mtcars$disp, method = “pearson”)


[1] 0.7909486
> cor.test(mtcars$hp, mtcars$disp, method = “pearson”)

Pearson’s product-moment correlation


data: mtcars$hp and mtcars$disp
t = 7.0801, df = 30, p-value = 7.143e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6106794 0.8932775
sample estimates:
cor
0.7909486
155 Statistical Analysis Using R

The correlation results can also be viewed graphically as in Fig. 5.6. The corrplot()
function can be used to analyze the correlation between the various columns of a
dataset, say mtcars. After this, the correlation between individual columns can be
compared by plotting it in separate graphs as in Fig. 5.7 and Fig. 5.8.
> library(corrplot)
> M <- cor(mtcars)
> corrplot(M, method = “number”)

Figure 5.6 Cor Plot of mtcars Dataset


> plot(mtcars$hp, mtcars$mpg, xlab=”Horse Power of the Cars”,
ylab=”Mileage per Gallon of the Cars”, pch=21)

Figure 5.7 Scatter Plot of Negative Correlation


156 R Programming — An Approach for Data Analytics

> plot(mtcars$hp, mtcars$disp, xlab=”Horse Power of the Cars”,


ylab=”Cylinder Displacement of the Cars”, pch=21)

Figure 5.8 Scatter Plot of Positive Correlation

It can be noted that the graph with negative correlation (Fig. 5.7) has the dots
from top left corner to bottom right corner and the graph with positive correlation
(Fig. 5.8) has the dots from the bottom left corner to the top right corner.

5.6. Regression Analysis

5.6.1. Linear Regression

Regression analysis is a widely used statistical tool. A relationship model is established


between the two variables used in the regression analysis. One of these variable
is called predictor variable whose value is gathered through experiments. The
other variable is called response variable whose value is derived from the predictor
variable. Linear Regression of the two variables are related through an equation.
The exponent of both the variables in this equation is 1. A linear relationship is
represented by a straight line when plotted as a graph. A non-linear relationship
where the exponent of any variable is not equal to 1 creates a curve.

The general mathematical equation for a linear regression is: y = ax + b, where


y is the response variable, x is the predictor variable and a and b are constants which
are called the coefficients.
157 Statistical Analysis Using R

A simple example of regression is predicting income of a person when his


expenditure is known. To do this we need to have the relationship between income
and expenditure of a person. First, carry out the experiment of gathering a sample
of observed values of income and corresponding expenditures. Then create a
relationship model using the lm() function in R. Find the coefficients from the
model created and create the mathematical equation using these values. Now, get a
summary of the relationship model to know the average error in prediction, which
is also called residuals. Finally, to predict the income of the new persons, use the
predict() function in R.

The function lm() creates the relationship model between the predictor and the
response variable. The basic syntax for lm() function in linear regression is as given
below.
lm(formula,data)

formula - relation between x and y


data - numeric vector

> x <- c(1510, 1740, 1380, 1860, 1280, 1360, 1790, 1630, 1520, 1310)
> y <- c(6300, 8100, 5600, 9100, 4700, 5700, 7600, 7200, 6200, 4800)
> model <- lm(y~x)
> model

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-3845.509 6.746

> summary(model)

Call:
lm(formula = y ~ x)
158 R Programming — An Approach for Data Analytics

Residuals:

Min 1Q Median 3Q Max


-630.02 -166.29 4.12 189.44 397.75

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3845.5087 804.9013 -4.778 0.00139 **
x 6.7461 0.5191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 325.3 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

The basic syntax for the function predict() in linear regression is as given below.
predict(object, newdata)

object - model created using lm() function


newdata - new numeric vector for predictor variable
> expense <- data.frame(x = 4700)
> income <- predict(model, expense)
> income
1
27861.18

The graphical representation of this linear regression is drawn by the below


code and Fig. 5.9.
> plot(y,x,col = “blue”,main = “Income & Expenditure Regression”,
abline(lm(x~y)),cex = 1.3,pch = 16,
xlab = “Income in Rs.”,ylab = “Expenditure in Rs.”)
159 Statistical Analysis Using R

Figure 5.9 Plot of Linear Regression

5.6.2. Multiple Regressions

Multiple regressions is an extension of linear regression into relationship between


more than two variables. In simple linear relation we have one predictor and one
response variable, but in multiple regressions we have more than one predictor
variable and one response variable.

The equation for multiple regressions consists of the below variables.


y = a + b1x1 + b2x2 +...bnxn

In this equation y is the response variable, a, b1, b2...bn are the coefficients and
x1, x2, ...xn are the predictor variables.

In R, the lm() function is used to create the regression model. The model
determines the value of the coefficients using the input data. Next we can predict
the value of the response variable for a given set of predictor variables using these
coefficients. The relationship model is built between the predictor variables and
the response variables. The basic syntax for lm() function in multiple regression is
as given below.
lm(y ~ x1+x2+x3..., data)
168 R Programming — An Approach for Data Analytics

We can conclude that the value of b1 is more close to 1 (1.253135) while the
value of b2 is more close to 2 (2.496484) and not 3.

5.7. Analysis of Variance (ANOVA)


Analysis of Variance (ANOVA) is a statistical measure that is used for investigating
data by comparing the means of subsets of the data. The base case is the one-way
ANOVA. In one-way ANOVA the data is sub-divided into groups based on a single
classification factor. The one-way ANOVA is used to verify if the means of many
groups are equals. But this analysis may not be very useful for complex problems.
For example, it may be necessary to take into account two factors of variability to
determine if the averages between the groups depend on the group classification or
the second variable that is to consider. In this case the two-way analysis of variance
(two-way ANOVA) should be used.

Consider the dataset PlantGrowth available in R for performing one-way


ANOVA using R. This dataset has two columns, the control group / treatment and
the weight of the plant indicating its growth. We want to check if the hypothesis
that the control group / treatment has effect on the plant weight / plant growth. The
below code does the same.
> plant = lm(PlantGrowth$weight ~ PlantGrowth$group)
> anova(plant)
Analysis of Variance Table

Response: PlantGrowth$weight
Df Sum Sq Mean Sq F value Pr(>F)
PlantGrowth$group 2 3.7663 1.8832 4.8461 0.01591 *
Residuals 27 10.4921 0.3886
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The result shows that the F-value is 4.8461 and the p-value is 0.01591 which is less
than 0.05 (5% level of significance). This shows that the null hypothesis is rejected,
that is the control group / treatment has effect on the plant growth / plant weight.
169 Statistical Analysis Using R

For two-way ANOVA, consider the below example of revenues collected for 5
years in each month. We want to see if the revenue depends on the Year and / or
Month or if they are independent of these two factors.
> revenue = c(15,18,22,23,24, 22,25,15,15,14, 18,22,15,19,21,
+ 23,15,14,17,18, 23,15,26,18,14, 12,15,11,10,8, 26,12,23,15,18,
+ 19,17,15,20,10, 15,14,18,19,20, 14,18,10,12,23, 14,22,19,17,11,
+ 21,23,11,18,14)

> months = gl(12,5)


> years = gl(5, 1, length(revenue))
> fit = aov(revenue ~ months + years)

> anova(fit)
Analysis of Variance Table

Response: revenue
Df Sum Sq Mean Sq F value Pr(>F)
months 11 308.45 28.041 1.4998 0.1660
years 4 44.17 11.042 0.5906 0.6712
Residuals 44 822.63 18.696

The significance of the difference between months is: F = 1.4998. This value is
lower than the value tabulated and indeed p-value > 0.05. So we cannot reject the
null hypothesis: the means of revenue evaluated according to the months are not
proven to be not equal, hence we remain in our belief that the variable “months”
has no effect on revenue.

The significance of the difference between years is: F = 0.5906. This value is
lower than the value tabulated and indeed p-value > 0.05. So we fail to reject the
null hypothesis: the means of revenue evaluated according to the years are not found
to be un-equal, then the variable “years” has no effect on revenue.

You might also like