EM622 Data Analysis and Visualization
Techniques for Decision-Making
Introduction to R and Data Manipulation
1 / 47
Getting Started
RStudio console
Options (Import dataset)
File Viewer (Data & Code)
Console (for typing commands) Plots
2 / 47
Your first graph
Copy and paste:
data(iris)
plot(Sepal.Width ~ Sepal.Length, data=iris,
col=c("red","orange","blue")[iris$Species],pch=16,
xlab="Sepal Length", ylab="Sepal Width")
legend("topright", legend=levels(iris$Species),
col=c("red","orange","blue"), bty="n",pch=16)
3 / 47
Agenda
1. Basic operations
2. Data structures
3. Data Manipulation
4. Your First Graph
4 / 47
Basic Operation - Import data
1. Import data from drop down menu in R Studio:
2. Import data from SAS/SPSS, etc: http://www.statmethods.net/input/importingdata.html
5 / 47
Intermediate - Import data
## install.packages(c("tseries","lubridate"))
library(tseries)
library(lubridate)
amazon <- as.data.frame(get.hist.quote("amzn",
start="2013-1-1", end="2018-9-15", quote=c("Cl")))
## time series starts 2013-01-02
## time series ends 2018-09-14
amazon$Date<-ymd(row.names(amazon))
tail(amazon)
## Close Date
## 2018-09-07 1952.07 2018-09-07
## 2018-09-10 1939.01 2018-09-10
## 2018-09-11 1987.15 2018-09-11
## 2018-09-12 1990.00 2018-09-12
## 2018-09-13 1989.87 2018-09-13
## 2018-09-14 1970.19 2018-09-14
6 / 47
Advanced - Import data
# list of addresses for raw data.
addressList <- list(
drives_address = "http://stats.nba.com/js/data/sportvu/drivesData.js",
defense_address = "http://stats.nba.com/js/data/sportvu/defenseData.js",
catchshoot_address = "http://stats.nba.com/js/data/sportvu/catchShootData.js")
# function that grabs the data from the website and converts to R data frame
readIt <- function(address) {
web_page <- readLines(address)
## regex to strip javascript bits and convert raw to csv format
x1 <- gsub("[\\{\\}\\]]", "", web_page, perl = TRUE)
x2 <- gsub("[\\[]", "\n", x1, perl = TRUE)
x3 <- gsub("\"rowSet\":\n", "", x2, perl = TRUE)
x4 <- gsub(";", ",", x3, perl = TRUE)
# read the resulting csv with read.table()
nba <- read.table(textConnection(x4), header = T,
sep = ",", skip = 2, stringsAsFactors = FALSE)
return(nba)
}
# download the data
df_list <- lapply(addressList, readIt)
7 / 47
Advanced (Cont.) - Import data
# check the data
catchshoot<-df_list$catchshoot_address
#str(catchshoot) # Get information about structure
head(catchshoot)
## PLAYER_ID PLAYER FIRST_NAME LAST_NAME TEAM_ABBREVIATION GP MIN
## 1 202691 Klay Thompson Klay Thompson GSW 78 34.0
## 2 1717 Dirk Nowitzki Dirk Nowitzki DAL 53 26.3
## 3 2594 Kyle Korver Kyle Korver CLE 35 24.6
## 4 201586 Serge Ibaka Serge Ibaka TOR 23 30.9
## 5 201567 Kevin Love Kevin Love CLE 60 31.4
## 6 202331 Paul George Paul George IND 74 35.8
## PTS FGM FGA FG_PCT FG3M FG3A FG3_PCT EFG_PCT PTS_TOT X
## 1 11.5 4.2 9.3 0.454 3.1 7.1 0.438 0.621 899 NA
## 2 8.1 3.4 7.5 0.446 1.3 3.5 0.388 0.535 427 NA
## 3 7.6 2.7 5.7 0.470 2.2 4.7 0.470 0.662 265 NA
## 4 7.5 2.9 6.9 0.424 1.7 4.3 0.394 0.547 173 NA
## 5 7.5 2.6 6.6 0.388 2.3 5.8 0.395 0.561 448 NA
## 6 7.4 2.7 6.1 0.437 2.0 4.8 0.420 0.603 546 NA
8 / 47
Advanced: scraping the web using R
#install.packages("rvest")
library(rvest)
# Store web url
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
#Scrape the website for the movie rating
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
#rating
# Scrape the website for the cast
cast <- lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
#cast
https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/
9 / 47
Advanced (Cont.): scraping the web using R
#Scrape the website for the movie rating
rating
## [1] 7.8
# Scrape the website for the cast
cast
## character(0)
https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/
10 / 47
Basic Operation - Export data
I Export dataframe into a spreedsheet,the easiest way to do this is to
use write.csv().
I By default, write.csv() includes row names, but these are usually
unnecessary and may cause confusion.
I The export file will be stored under working directory.
# export 'mydf' as a .csv file:
write.csv(mydf,"test.csv")
I How to find out your working directory?
# returns an absolute filepath representing the current working directory o
getwd()
## [1] "/Users/annieyu/Dropbox/622 visualization/lectures/Lecture 3_intro_t
I Write data into other format files:
http://www.cookbook- r.com/Data_input_and_output/Writing_data_to_a_file/
11 / 47
Basic Operation - Install pacakges
Two ways to install a package:
1. From drop down menu in R Studio:
2. Using command:
# Download and install packages from CRAN-like repositories or from local f
install.packages(c("ggplot2","tidyr","dplyr"))
# Always load package before call it:
library(ggplot2)
12 / 47
Basic Operation - Update pacakges
1. To update all your installed packages to the latest versions available:
update.packages()
2. To store your R code, always create a R script:
3. Export your images to pdf/png format:
13 / 47
Getting Started
R programming style
I R is case sensitive: a and A are two different objects.
I The assignment symbol is <-. Alternatively, the classical = symbol
can be used.
I The symbol # comments to the end of the line:
# This is a comment
# The two following statements are equivalent:
a <- 1
# Assigning value 1 to object a:
a = 1
14 / 47
Data Structure
1. Vector
2. Matrix
3. Array
4. Data Frame
5. List
http://venus.ifca.unican.es/Rintro/dataStruct.html
15 / 47
Data Structure - Variable
Like most other languages, R lets you assign values to variables and refer
to them by name:
x <- 1
# x gets 1
y <- 2
# c(...): a generic function which combines values into a vector
z <- c(x,y)
# evaluate z to see what's stored as z
z
## [1] 1 2
Notice that the substitution is done at the time that the value is assigned
to z, not the time that z is evaluated:
y <- 5
z
## [1] 1 2
16 / 47
Data Structure - Vector
Fetch element(s) by location in a vector:
a <- c(1,2,3,4,5,6,7,8)
a
## [1] 1 2 3 4 5 6 7 8
# fetch the 5th item in vector a:
a[5]
## [1] 5
# fetch item 1 through 6:
a[1:6]
## [1] 1 2 3 4 5 6
# fetch item 1, 3, 7:
a[c(1,3,7)]
## [1] 1 3 7
17 / 47
Data Structure - Array
I In R, you can construct more complicated data structures than just
vectors.
I An array object is just a vector that’s associated with a dimension
attribute.
# Define an array
a <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim=c(2, 4))
a
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
# fetch one cell in array a:
a[2,3]
## [1] 6
# fetch 1st row only
a[1,]
## [1] 1 3 5 7
18 / 47
Data Structure - Data frame
I A data frame is a list that contains multiple named vectors that are
the same length.
I Like a spreadsheet or a database table, particularly good for
representing experimental data.
# data.frame() is a function to creates data frames
team <-c("A","B","C","D","E")
first <- c(92, 89, 94, 72, 59)
second <- c(70, 73, 77, 90, 102)
mydf <- data.frame(team, first, second)
mydf
## team first second
## 1 A 92 70
## 2 B 89 73
## 3 C 94 77
## 4 D 72 90
## 5 E 59 102
# refer to the components of a data frame by name:
mydf$team
## [1] A B C D E
## Levels: A B C D E
19 / 47
Data Structure - List
I R has a built-in data type for mixing objects of different types, called
lists.
# list() function to construct R lists.
#Example: a list containing two strings, and a data frame
e <- list(thing=c("hat","shoes"), size=c("8.25","5"), myData=mydf)
e
## $thing
## [1] "hat" "shoes"
##
## $size
## [1] "8.25" "5"
##
## $myData
## team first second
## 1 A 92 70
## 2 B 89 73
## 3 C 94 77
## 4 D 72 90
## 5 E 59 102
20 / 47
Data Structure - List Cont
# fetch the 1st item in the list:
e$thing
## [1] "hat" "shoes"
e[1]
## $thing
## [1] "hat" "shoes"
# fetch the 1st row in the data frame
# which is the third component in the list:
e$myData[1,]
## team first second
## 1 A 92 70
21 / 47
Data Structure - Get Info about structure
# Here are some sample variables for example:
n <- 1:4
let <- LETTERS[1:4]
let
## [1] "A" "B" "C" "D"
df <- data.frame(n, let)
df
## n let
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
# Get information about structure
str(df)
## 'data.frame': 4 obs. of 2 variables:
## $ n : int 1 2 3 4
## $ let: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
22 / 47
Data Structure - Get Info about structure
# Get the length of a vector
length(n)
## [1] 4
# Number of rows
nrow(df)
## [1] 4
# Number of columns
ncol(df)
## [1] 2
# Get num of rows and columns
dim(df)
## [1] 4 2
23 / 47
1
Data Exploration
“Happy families are all alike; every unhappy family is unhappy in its own
way. ” Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own
way. ” Hadley Wickham
1 Hadley Wickham. http://r4ds.had.co.nz/tidy-data.html
24 / 47
Working with NA and NaN
There are some special characters in R
I NA : Not Available (ie missing values)
I NaN : Not a Number
I Inf: Infinity
I -Inf : Minus Infinity
# For instance:
0/0
## [1] NaN
1/0
## [1] Inf
# Here's how to test whether a variable has one of these values:
y <- NA
# Is y NA?
is.na(y)
## [1] TRUE
25 / 47
Working with NA and NaN
Ignoring "bad" values in vector summary functions:
I If you run functions like mean() or sum() on a vector or data frame
containing NA or NaN, they will return NA and NaN(bad value).
I Many of these functions take the flag na.rm, which tells them to
ignore these values:
df1 <- c(1, 2, 3, NA, 5)
mean(df1)
## [1] NA
mean(df1, na.rm=TRUE)
## [1] 2.75
df2 <- c(1, 2, 3, NaN, 5)
sum(df2)
## [1] NaN
sum(df2, na.rm=TRUE)
## [1] 11
26 / 47
Example: Import Data
library(readr)
HW <- read_csv("dataSets/Student_List_HW.csv")
HW<-as.data.frame(HW)
summary(HW)
## Last_Name First_Name Status
## Length:20 Length:20 Length:20
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Home Homework_1 Homework_2 Homework_3
## Length:20 Min. :58.00 Min. :77.00 Min. : 80.00
## Class :character 1st Qu.:70.50 1st Qu.:80.00 1st Qu.: 85.50
## Mode :character Median :74.50 Median :88.00 Median : 90.50
## Mean :77.39 Mean :87.35 Mean : 90.90
## 3rd Qu.:84.25 3rd Qu.:93.00 3rd Qu.: 98.25
## Max. :99.00 Max. :99.00 Max. :100.00
## NA's :2
27 / 47
Example: Replace Missing Variables
HW$Homework_1[is.na(HW$Homework_1)]<-0
HW$Home[which(HW$Last_Name=="Garcia")]<-"NJ"
HW$Home[is.na(HW$Home)]<-"Unknown"
HW<-HW[complete.cases(HW),]
summary(HW)
## Last_Name First_Name Status
## Length:18 Length:18 Length:18
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Home Homework_1 Homework_2 Homework_3
## Length:18 Min. : 0.00 Min. :77.00 Min. : 80.00
## Class :character 1st Qu.:66.75 1st Qu.:80.00 1st Qu.: 86.25
## Mode :character Median :74.50 Median :86.00 Median : 90.50
## Mean :70.28 Mean :86.39 Mean : 91.33
## 3rd Qu.:84.25 3rd Qu.:91.75 3rd Qu.: 98.75
## Max. :99.00 Max. :98.00 Max. :100.00
28 / 47
Subset Observations (Rows)2
2 https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-
cheatsheet.pdf
29 / 47
Subset Observations (Rows) Cont.
#load dplyr
library(dplyr)
Subset_HW_1 <- filter(HW,Status == "Master")
head(Subset_HW_1)
## Last_Name First_Name Status Home Homework_1 Homework_2 Homework_3
## 1 Brown Susan Master NJ 74 88 98
## 2 Wilson Karen Master NJ 0 93 84
## 3 Moore Nancy Master PA 74 91 89
## 4 Taylor Betty Master GA 93 92 88
## 5 Anderson Anthony Master CA 96 98 100
## 6 Thomas Donald Master NJ 82 77 96
30 / 47
Subset Variables (Columns)
There are many options to choose columns
31 / 47
Subset Variables (Columns) Cont.
Subset_HW_2 <- select(HW,contains("Name"),contains("Homework"))
head(Subset_HW_2)
## Last_Name First_Name Homework_1 Homework_2 Homework_3
## 1 Smith Patricia 82 97 82
## 2 Johnson Jennifer 0 77 99
## 3 Williams Robert 99 80 80
## 4 Jones Michael 75 82 86
## 5 Brown Susan 74 88 98
## 7 Miller Richard 85 78 82
32 / 47
Subset Observations (Rows) and Variables (Columns)
Subset_HW_3 <- subset(HW,Status == "Master" ,
select=c("Last_Name","First_Name",
"Homework_1","Homework_2","Homework_3"))
head(Subset_HW_3)
## Last_Name First_Name Homework_1 Homework_2 Homework_3
## 5 Brown Susan 74 88 98
## 8 Wilson Karen 0 93 84
## 9 Moore Nancy 74 91 89
## 10 Taylor Betty 93 92 88
## 11 Anderson Anthony 96 98 100
## 12 Thomas Donald 82 77 96
33 / 47
Pipe Operator
Piping makes coding more readable and allow us to make several actions
in one sentence such as sort, filter, or create a variable.
34 / 47
Pipe Operator Cont.
HW %>%
filter(Status == "Master") %>%
select(contains("Name"),contains("Homework"))%>%
arrange(desc(Homework_1))%>%
head()
## Last_Name First_Name Homework_1 Homework_2 Homework_3
## 1 Anderson Anthony 96 98 100
## 2 Taylor Betty 93 92 88
## 3 Garcia Linda 93 91 100
## 4 Thomas Donald 82 77 96
## 5 Brown Susan 74 88 98
## 6 Moore Nancy 74 91 89
35 / 47
Create New Columns and Re-order
The mutate() function will add new columns to the data frame.
Arrange or re-order rows using arrange().
HW_update<-HW %>%
filter(Status != "Unknown") %>%
mutate(Homework_Average = 0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)%>%
arrange(desc(Homework_Average))
head(HW_update)
## Last_Name First_Name Status Home Homework_1 Homework_2
## 1 Anderson Anthony Master CA 96 98
## 2 Garcia Linda Master NJ 93 91
## 3 Wang Thomas PhD CHINA 72 98
## 4 Martin Morgan Undergraduate NJ 72 88
## 5 Brown Susan Master NJ 74 88
## 6 Taylor Betty Master GA 93 92
## Homework_3 Homework_Average
## 1 100 98.6
## 2 100 95.9
## 3 95 91.3
## 4 99 90.3
## 5 98 90.2
## 6 88 90.2
36 / 47
Split-Apply-Combine
Idea: split up a big problem into manageable pieces, apply a function to
each piece and then combine all the pieces together.
Split Apply Combine
(by X) X Y (average)
A 2
A 4
X Y
X Y A 3 X Y
A 2 A 3
A 4 X Y X Y B 2.5
B 0 B 0 B 2.5 C 7.5
B 5 B 5
C 5
C 10
X Y X Y
C 5 B 7.5
C 10
37 / 47
Group Data
Implement group operations in the “split-apply-combine” concept:
38 / 47
Group Data
Group_Summarise_HW<- HW %>%
filter(Status != "Unknown") %>%
mutate(Homework_Average = 0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)%>%
group_by(Status) %>%
summarise(Homework_Average=mean(Homework_Average),
Number_of_Student=length(Status))%>%
arrange(desc(Homework_Average))
head(Group_Summarise_HW)
## # A tibble: 3 x 3
## Status Homework_Average Number_of_Student
## <chr> <dbl> <int>
## 1 Master 87.4 8
## 2 PhD 86.4 2
## 3 Undergraduate 83.7 8
39 / 47
Reshape Data3
Lets change the layout of a data set, our tools from Tidyr library are:
3 https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-
cheatsheet.pdf
40 / 47
Reshape Data Cont.
I gather() makes "wide" data longer
I unite() combines two variables into one variable
#load tidyr
library(tidyr)
tidyr_HW<- HW %>% unite(Name, First_Name, Last_Name, sep = " ")%>%
select(-c(Status,Home)) %>%
gather(Homework, Score, Homework_1:Homework_3)
head(tidyr_HW)
## Name Homework Score
## 1 Patricia Smith Homework_1 82
## 2 Jennifer Johnson Homework_1 0
## 3 Robert Williams Homework_1 99
## 4 Michael Jones Homework_1 75
## 5 Susan Brown Homework_1 74
## 6 Richard Miller Homework_1 85
41 / 47
Merge Data
Exam<- read_csv("dataSets/Student_List_Exam.csv")
Exam<-as.data.frame(Exam)
head(Exam,3)
## Last_Name First_Name Exam Project
## 1 Smith Patricia 77 65
## 2 Johnson Jennifer 100 96
## 3 Williams Robert 92 53
HW_update<-mutate(HW,Homework_Average =
0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)
Merged_df<-inner_join(HW_update, Exam,by=c("Last_Name","First_Name"))
head(Merged_df,3)
## Last_Name First_Name Status Home Homework_1 Homework_2 Homework_3
## 1 Smith Patricia Undergraduate MD 82 97 82
## 2 Johnson Jennifer Undergraduate NY 0 77 99
## 3 Williams Robert Undergraduate NY 99 80 80
## Homework_Average Exam Project
## 1 86.5 77 65
## 2 72.6 100 96
## 3 83.8 92 53
42 / 47
ggplot2
I ggplot2 is an R package designed for creating high quality plots.
I ggplot is based on the layered grammar of graphics, which means
that plots can be constructed layer by layer.
#you need to install the package just once
install.packages('ggplot2')
43 / 47
Composition of plots in ggplot2
Plots have two main components: 1) data to use and 2) type of plot.
Basic We want
function points Aesthetics
for plotting
ggplot(data=economics) + geom_point(aes(x=date, y=unemploy))
Specify Specify
Dataset what goes what goes
on the on the
X axis Y axis
Type of plot
Data to use
44 / 47
Our first offcial graph
library(ggplot2)
ggplot(data=iris)+
geom_point(aes(x=Sepal.Width,y=Sepal.Length,colour=Species))
Species
Sepal.Length
setosa
6 versicolor
virginica
2.0 2.5 3.0 3.5 4.0 4.5
Sepal.Width
45 / 47
Resources
1. Rob Kabacoff, “R in Action”: https://www.amazon.com/Action- Data- Analysis- Graphics/dp/
1617291382/ref=pd_sbs_14_t_0?_encoding=UTF8&psc=1&refRID=EEBN1DRHWQ6J09Z6TTBY
2. Michael J Crawley, “The R Book”:
http://users.humboldt.edu/ygkim/CrawleyMJ_TheRBook.pdf
3. Joseph Adler, “R in a Nutshell”:
http://www.amazon.com/R- Nutshell- Joseph- Adler/dp/144931208X
4. Quick-R tutorial: http://www.statmethods.net/input/datatypes.html
5. Cookbook for R, Data input and output:
http://www.cookbook- r.com/Data_input_and_output/Writing_data_to_a_file/
46 / 47
What have we learned?
1. Define Data structures such as vector, array, list and dataframe.
2. Basic operations such as install package, import/export datasets
3. Common data manipulation operations such as filtering for rows,
selecting specific columns, re-ordering rows, adding new columns,
summarizing data, and performing the "split-apply-combine" task
4. Draw the graph
47 / 47