DSA2101
Essential Data Analytics Tools: Data Visualization
Yuting Huang
AY24/25
Week 3: Importing Data I
1 / 36
The teaching team
Instructor:
▶ Dr. Huang Yuting (yhuang@[Link])
▶ Office: S16 04-01
▶ Office hour: In-person and by appointment
Teaching assistants (TAs): In-person/online and by appointment
▶ Yeo Jaye Lin (e1249197@[Link])
▶ Quek Chui Qing (e1157262@[Link])
▶ Agrawal Naman (naman.a@[Link])
▶ Loo Wen Wen (e0970566@[Link])
▶ Zhang Mingyuan (e0970135@[Link])
2 / 36
Tutorials in Week 3
Tutorials will begin in this week.
Due to the CNY public holidays, we will reschedule the session online.
▶ Your TA will be in touch with you and share the time and
meeting link.
▶ All sessions will be recorded and available on Canvas by end of
this week.
3 / 36
Importing data into R
1. CSV files Week 3
2. Flat files
3. Excel Files
4. R data files Week 4
5. JSON Files
6. Files from the web
7. APIs
4 / 36
Recap
An important pre-requisite to loading data into R is that we are able
to point to the location at which the data files are stored.
1. Where am I?
2. Where are my data?
5 / 36
Working directory
The first question addresses the notion of our current working
directory.
▶ Typically, it is the location of our current R script.
▶ The function getwd() returns the absolute path of our current
working directory.
getwd()
6 / 36
File path
The second question implies that data are not necessarily stored at
the location of our current working directory.
▶ Relative path: the address of a file relative to our current
working directory.
▶ Access files directly in the current working path.
▶ Use two dots .. to denote “one level up in the directory
hierarchy”.
Using relative path in all code you write.
7 / 36
File path (Important!)
We will strictly adhere to the following practice:
▶ Store all course materials in a folder named DSA2101.
▶ Within DSA2101, create a sub-folder named src to store all R
scripts and Rmd files.
▶ Within DSA2101, create another sub-folder called data to store
all data sets.
▶ The src and data folders should be positioned at the same
hierarchical level within DSA2101.
8 / 36
Memory requirements for R objects
Remember that R stores all its objects using physical memory.
▶ It is important to be aware of how much memory is being used in
your workspace.
▶ Especially when we are reading in or creating a new (large) data
set in R.
Other programs running on our computer take up RAM; other R
objects exist in the workspace, also take up RAM.
9 / 36
Memory requirements for R objects
If you do not have enough RAM, your computer (or at least
your R session) will freeze up.
▶ Usually an unpleasant experience that requires you to kill the R
session (the best scenario), or
▶ . . . reboot your computer.
So make sure you understand the memory requirements before
reading in or creating large data sets!
Read more about this on Posit.
10 / 36
Comma separated values
We first consider the simplest file format – comma separated values
(CSV).
Alice, 98, 92, 94
Brown, 85, 89, 91
Carly, 81, 96, 97
These files are in fact just text files, with
▶ An optional header, listing the column names.
▶ Each observation separated by commas within each row.
11 / 36
What does a CSV file look like?
A .csv file, opened in a text editor.
▶ This is the raw form of the data.
12 / 36
What does a CSV file look like?
Here is the same file opened in Microsoft Excel.
▶ Excel assumes that it is a spreadsheet and put elements in its
own cell.
13 / 36
Read a CSV file into R
The base R command to read a CSV file is [Link]()
The main arguments to this function are:
▶ file: The file name.
▶ header: Absence/presence of a header row. The default is TRUE.
▶ [Link]: The names to identify columns in the table.
▶ stringsAsFactors: Whether to convert character vectors to
factors.
▶ [Link]: Specify strings to be interpreted as NA values.
14 / 36
Example: Education, Height, and Income
The file [Link] contains information on 1192 individuals.
▶ Contains 6 columns. There’s also a column header.
▶ Hence, we read in the data in the following way:
heights <- [Link]("../data/[Link]", header = TRUE)
dim(heights)
## [1] 1192 6
▶ The function dim() (stands for dimensions) tells us that the
data frame has 1192 rows and 6 columns.
15 / 36
Data checks
1. What type has each column been read in as?
str(heights)
## ’[Link]’: 1192 obs. of 6 variables:
## $ earn : num 50000 60000 30000 50000 51000 9000 29000 32000 2000 2
## $ height: num 74.4 65.5 63.6 63.1 63.4 ...
## $ sex : chr "male" "female" "female" "female" ...
## $ ed : int 16 16 16 16 17 15 12 17 15 12 ...
## $ age : int 45 58 29 91 39 26 49 46 21 26 ...
## $ race : chr "white" "white" "white" "other" ...
▶ The function str() (stands for structure) reveals information
about the columns, giving the names of the columns and a peek
into the contents of each.
16 / 36
Data checks
2. race is a categorical variable.
What are the different races that have been read in?
heights$race <- factor(heights$race)
levels(heights$race)
## [1] "black" "hispanic" "other" "white"
▶ A contingency variable of the counts of each factor level:
table(heights$race)
##
## black hispanic other white
## 112 66 25 989
17 / 36
Data checks
3. Are there any missing values in the data?
sum([Link](heights))
## [1] 0
▶ Use [Link]() to check missing entries in the entire data set.
18 / 36
Summary statistics
▶ We can compute summary statistics for earn:
summary(heights$earn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 200 10000 20000 23155 30000 200000
▶ Summary statistics by group with aggregate():
aggregate(earn ~ sex, data = heights, FUN = median)
## sex earn
## 1 female 15000
## 2 male 25000
19 / 36
Histogram
Let us use a histogram to visualize the distribution of income.
▶ A histogram, hist(), divides the range of numeric values into
bins, then counts the number of observations that fall into each
bin.
▶ By default, the height of each bar represents frequencies.
▶ freq = FALSE alters a histogram such that the height represents
the probability densities (that is, the histogram has a total area
of one).
20 / 36
hist(heights$earn, freq = FALSE, col = "maroon",
main = "Histogram of Earnings", xlab = "Earnings")
Histogram of Earnings
1.5e−05
Density
0.0e+00
0 50000 100000 150000 200000
Earnings
▶ The distribution of income is right-skewed, as expected.
21 / 36
Histogram (revised code)
Our presentation of the histogram can be improved:
1. The bins correspond to intervals of width 20,000. We would like
bins of width 10,000 instead.
2. Transform the x-axis to display earnings in thousands of dollars
for better readability.
hist(heights$earn/1000, freq = FALSE, col = "maroon",
breaks = seq(0, 200, by = 10),
main = "Histogram of Earnings",
xlab = "Earnings (in thousands)")
▶ heights$earn/1000 divides earnings by a thousand. Now the
earnings value ranges from 0.2 to 200.
▶ breaks = seq(0, 200, by = 10) sets the range of the x-axis
from 0 to 200, and split it into bins with width 10.
22 / 36
Histogram (revised code)
Histogram of Earnings
0.030
0.020
Density
0.010
0.000
0 50 100 150 200
Earnings (in thousands)
23 / 36
The income distribution
Who are those high-earning individuals – earn more than 100,000 a
year?
# [Link]("tidyverse")
library(tidyverse)
filter(heights, earn > 100000)
## earn height sex ed age race
## 1 125000 74.34062 male 18 45 white
## 2 170000 71.01003 male 18 45 white
## 3 175000 70.58955 male 16 48 white
## 4 148000 66.74020 male 18 38 white
## 5 110000 65.96504 male 18 37 white
## 6 105000 74.58005 male 12 49 white
## 7 123000 61.42908 female 14 58 white
## 8 200000 69.66276 male 18 34 white
## 9 110000 66.31203 female 18 48 other
24 / 36
The income distribution
library(tidyverse)
filter(heights, earn > 100000)
The code uses the dplyr syntax.
▶ It is an great tool for data cleaning and manipulation.
▶ We shall learn about it soon.
▶ For now, only need to understand that it filters irrelevant rows
from the heights data frame, keeping only those who earned
more than 100, 000 per year.
25 / 36
Recap
Remember that you should inspect your data before and after you
read them in.
▶ Try to think of as many ways in which it could have gone wrong
and check.
As we covered here, you should at least consider the following:
▶ Correct number of rows and columns.
▶ Column variables read in with the correct class type.
▶ Missing values.
26 / 36
Flat file
The readr package is developed to deal with reading in large flat
files quickly.
▶ Faster than base R analogues.
▶ The function for CSV files is read_csv().
# [Link]("readr")
library(readr)
heights <- read_csv("../data/[Link]")
▶ We can also use this function to read data directly from a URL
(more on this later).
27 / 36
Other file types
readr provides other functions to read in data:
▶ read_csv2() reads semicolon-separated files.
▶ read_tsv() reads tab-delimited files.
▶ read_delim() reads in files with any delimiter, attempting to
automatically guess the delimiter if you do not specify it.
▶ ...
Useful documentation and cheatsheet on data import.
28 / 36
Excel spreadsheets
To read data from xls and xlsx spreadsheets, we need the readxl
package.
# [Link]("readxl")
library(readxl)
▶ The read_excel() function automatically detects the rectangle
region that contains non-empty cells in the Excel spreadsheet.
▶ Nonetheless, ensure that you open up your file in Excel first, to
see what it contains and how you can provide further contextual
information for the function to use.
29 / 36
Excel example
read_excel("../data/read_excel_01.xlsx")
## # A tibble: 7 x 5
## ‘Table 1‘ ...2 ...3 ...4 ...5
## <lgl> <lgl> <chr> <dbl> <chr>
## 1 NA NA <NA> NA <NA>
## 2 NA NA <NA> NA <NA>
## 3 NA NA <NA> NA <NA>
## 4 NA NA <NA> NA <NA>
## 5 NA NA a 1 m
## 6 NA NA b 2 m
## 7 NA NA c 3 m
▶ In this example, read_excel() needs a little help as the data
seems to be “floating” in the center of the worksheet.
30 / 36
Excel example
read_excel("../data/read_excel_01.xlsx", skip = 5)
## # A tibble: 2 x 3
## a ‘1‘ m
## <chr> <dbl> <chr>
## 1 b 2 m
## 2 c 3 m
▶ The skip argument tells R to skip a certain number of rows.
▶ By default, the function reads the first row as the header. We
can disable it with col_names = FALSE.
▶ Notice that read_excel() uses a col_names argument, instead
of header.
31 / 36
Excel example
Another way is the specify the data range precisely.
▶ We can also supply a set of column names in col_names.
read_excel("../data/read_excel_01.xlsx",
range = "C6:E8", col_names = c("var1", "var2", "var3"))
## # A tibble: 3 x 3
## var1 var2 var3
## <chr> <dbl> <chr>
## 1 a 1 m
## 2 b 2 m
## 3 c 3 m
▶ In case you were wondering, a tibble is an improved version of a
data frame. We shall learn more about it soon.
32 / 36
Example: Workplace injuries
The excel file Workplace_injuries.xlsx contains data on selected
workplace injuries from 2019 to 2022.
▶ Originally from the Ministry of Manpower (MOM).
injuries <- read_excel("../data/Workplace_injuries.xlsx")
injuries
## # A tibble: 6 x 5
## Type ‘2019‘ ‘2020‘ ‘2
## <chr> <dbl> <dbl> <
## 1 Crushing, fractures and dislocations 3107 2577
## 2 Cuts and Bruises 4500 3895
## 3 Sprains & Strains 1982 1791
## 4 Others 2418 1675
## 5 <NA> NA NA
## 6 Notes: Workplace injury numbers include injuries ~ NA NA
33 / 36
To read in the correct range of data, we should specify an appropriate
range.
injuries <- read_excel("../data/Workplace_injuries.xlsx",
range = "A1:E5")
injuries
## # A tibble: 4 x 5
## Type ‘2019‘ ‘2020‘ ‘2021‘ ‘2022‘
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Crushing, fractures and dislocations 3107 2577 2950 2759
## 2 Cuts and Bruises 4500 3895 4263 4333
## 3 Sprains & Strains 1982 1791 1829 1778
## 4 Others 2418 1675 2100 2022
34 / 36
Common errors
When we first start importing data into R, it’s common to see some
frustrating error messages.
▶ The most common error is:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'some_file.csv': No such file or directory
▶ This indicates that R cannot find the file you are trying to import.
▶ Check your file path! Perhaps also the spelling of the filename.
35 / 36
Summary
We learn about importing data from different formats and sources:
1. CSV file using [Link]()
2. Flat file using functions from the readr package.
3. Excel file with read_excel() from the readxl package.
Also a few more ways to clean and visualize data.
36 / 36