3 Exploring and Visualizing Data
3 Exploring and Visualizing Data
Political Science
Exploring and Visualizing Data
Elmer Poliquit
What is RStudio?
RStudio is an Integrated Development Environment (IDE) that
facilitates issuing of commands in R interactively.
It has many convenient and easy-to-use tools for coding, plot
management, object browsing, package management, among other
things.
To download RStudio: https://www.rstudio.com/ Quantitative Research Methods for Political Science - ESPoliquit
2 / 35
Introduction to RStudio
library(MASS)
Other Remarks
▶ More syntax will be added as we progress with the learning.
▶ R syntax is case-sensitive (e.g., Variables x and X are
different. Thus, creating objects with the same name but
different case is highly discouraged.
▶ It is likely that a process can be coded and done in a number
of ways in R. Quantitative Research Methods for Political Science - ESPoliquit
5 / 35
Basic Data Management and Functions in RStudio
Other Remarks
▶ Assignment operator. R expressions at the right can be
assigned to objects at its left using the assignment operator
← or =.
x<-c(1,2,3)
x
[1] 1 2 3
y=c(1,2,3)
y
[1] 1 2 3
Quantitative Research Methods for Political Science - ESPoliquit
6 / 35
Basic Data Management and Functions in RStudio
Vector
ages=c(12,15,14,17,18,11,13,14,14)
mean(ages)
[1] 14.22222
hist(ages)
Histogram of ages
3
Frequency
2
1
0
11 12 13 14 15 16 17 18
ages
Quantitative Research Methods for Political Science - ESPoliquit
7 / 35
EXPLORING AND VISUALIZING DATA
Objectives
▶ To identify the ways to characterize data before doing serious
analysis
▶ To understand the appropriate measure
▶ To error-check
Histogram
20
15
Frequency
10
5
0
10 20 30 40 50 60 70 80
Age
library(sm)
sm.density(patients$Age, model = "Normal", xlab = "Age")
0.05
0.04
Probability density function
0.03
0.02
0.01
0.00
20 40 60 80
Age
addmargins(table(patients$Sex))
F M Sum
30 20 50
To view in percentage
addmargins(prop.table(table(patients$Sex))*100)
F M Sum
60 40 100
Quantitative Research Methods for Political Science - ESPoliquit
13 / 35
3.1 Characterizing Data
A contingency table, sometimes called a two-way frequency
table, is a tabular mechanism with at least two rows and two
columns used in statistics to present categorical data in terms of
frequency counts.
library(kableExtra)
cont_table <- table(patients[,c("Sex","Blood.Type")])
kable(addmargins(cont_table),
caption= "Contingency Table of Sex and Blood Type")
Table 1
Contingency Table of Sex and Blood Type
A AB B O Sum
F 6 7 6 11 30
M 7 3 0 10 20
Sum 13 10 6 21 50
Quantitative Research Methods for Political Science - ESPoliquit
14 / 35
3.1 Characterizing Data
Central Tendency
mean(patients$Age)
[1] 41.78
Sex Age
1 F 40.23333
2 M 44.10000
Quantitative Research Methods for Political Science - ESPoliquit
15 / 35
3.1 Characterizing Data
Central Tendency
median(patients$Age)
[1] 40
Sex Age
1 F 37
2 M 41
Quantitative Research Methods for Political Science - ESPoliquit
16 / 35
3.1 Characterizing Data
Central Tendency
3. The Mode: The most frequently occurring value
[1] 41
Sex Age
1 F 32
2 M 41 Quantitative Research Methods for Political Science - ESPoliquit
17 / 35
3.1 Characterizing Data
Appropriateness
▶ Mean is an appropriate measure for closely related values
▶ Median is an appropriate measure for variable with extreme
values
▶ Mode is an appropriate measure for categorical/qualitative
variable
Moments
In addition to measures of central tendency, “moments” are
important ways to characterize the shape of the distribution of a
sample variable.
Moments are applicable when the data measured is interval type
(the level of measurement). The first four moments are those that
are most often used.
Moments
Most often, kurtosis is measured against the normal distribution.
▶ If the kurtosis is close to 0, then a normal distribution is often
assumed. These are called mesokurtic distributions.
▶ If the kurtosis is less than zero, then the distribution is light
tails and is called a platykurtic distribution.
▶ If the kurtosis is greater than zero, then the distribution has
heavier tails and is called a leptokurtic distribution.
Table 2
Descriptive Statistics
Age Systolic.BP..mm.Hg.
nbr.val 50.00 50.00
nbr.null 0.00 0.00
nbr.na 0.00 0.00
min 19.00 105.00
max 72.00 163.00
range 53.00 58.00
sum 2089.00 6546.00
median 40.00 129.50
mean 41.78 130.92
SE.mean 1.72 2.07
CI.mean.0.95 3.46 4.16
var 147.97 214.28
std.dev 12.16 14.64
coef.var 0.29 0.11
Quantitative Research Methods for Political Science - ESPoliquit
23 / 35
3.1 Characterizing Data
Apart from central tendency and moments, probability
distributions can also be characterized by order statistics. Order
statistics are based on the position of a value in an ordered list.
Typically, the list is ordered from low values to high values.
qqnorm(patients$Age); qqline(patients$Age)
50
40
30
20
−2 −1 0 1 2
Theoretical Quantiles
Quantitative Research Methods for Political Science - ESPoliquit
25 / 35
3.1 Characterizing Data
The summary function produces the minimum value, the first
quartile, median, mean, third quartile, and max value. The generic
function quantile produces sample quantiles corresponding to the
given probabilities.
summary(patients$Age)
quantile(patients$Age) #default
25% 75%
34.0 47.5
Quantitative Research Methods for Political Science - ESPoliquit
26 / 35
3.1 Characterizing Data
A boxplot is a standardized way of displaying the distribution of
data based on a five number summary.
20 30 40 50 60 70
There are two data points that are different from the other
observations. Quantitative Research Methods for Political Science - ESPoliquit
27 / 35
3.1 Characterizing Data
An outlier is a data point that differs significantly from other
observations. It is below the lower fence (lf) and above the upper
fence (uf).
lf=quantile(patients$Age, 0.25)-1.5*IQR(patients$Age)
uf=quantile(patients$Age, 0.75)+1.5*IQR(patients$Age)
c(lf,uf)
25% 75%
13.75 67.75
[1] 72 70
Quantitative Research Methods for Political Science - ESPoliquit
28 / 35
3.1 Characterizing Data
An outlier is a data point that differs significantly from other
observations. It is below the lower fence (lf) and above the upper
fence (uf).
x = patients$Age
boxplot(x, ylim=c(13, 75), horizontal=T)
text(34, 1.2, "Q1", pos=3)
text(40, 1.2, "Q2", pos=3)
text(47.50, 1.2, "Q3", pos=3)
text(70, 1, "70", pos=1)
text(72, 1, "72", pos=3)
text(13.75, 1, "LF", pos=3)
text(67.75, 1, "UF", pos=3)
points(c(13.75,67.75), c(1,1), pch=c(25,24), col ="red" )
Quantitative Research Methods for Political Science - ESPoliquit
29 / 35
3.1 Characterizing Data
Q1 Q2 Q3
LF UF 72
70
20 30 40 50 60 70
dim(wages)
[1] 478 8
Quantitative Research Methods for Political Science - ESPoliquit
31 / 35
Activity 3.0
You need to have 50 samples only from the data using the code
below with set.seed (your ID no. last 3 digits).
set.seed(143)
wage=wages[sample(nrow(wages), 50,
replace = FALSE, prob = NULL),]