0% found this document useful (0 votes)
8 views35 pages

3 Exploring and Visualizing Data

This document introduces RStudio and provides an overview of basic data management and visualization techniques in R. It discusses setting the working directory in RStudio, writing comments in code, installing and loading packages, and creating vectors. It then covers exploring and visualizing data, including characterizing data through measures, ranges, missing values, frequency tables, histograms, density curves, and contingency tables. The goal is to help users understand how to appropriately characterize data before analysis and check for errors.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
8 views35 pages

3 Exploring and Visualizing Data

This document introduces RStudio and provides an overview of basic data management and visualization techniques in R. It discusses setting the working directory in RStudio, writing comments in code, installing and loading packages, and creating vectors. It then covers exploring and visualizing data, including characterizing data through measures, ranges, missing values, frequency tables, histograms, density curves, and contingency tables. The goal is to help users understand how to appropriately characterize data before analysis and check for errors.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 35

POS 2108 - Quantitative Research Methods for

Political Science
Exploring and Visualizing Data

Elmer Poliquit

Quantitative Research Methods for Political Science - ESPoliquit


1 / 35
Introduction to RStudio
What is R?
R is an open-source software environment for statistical computing
and graphics.
It has a worldwide repository system called the Comprehensive R
Archive Network (CRAN) http://cran.r-project.org to download
the software and user-contributed add-on packages.
▶ The R language is case sensitive.

What is RStudio?
RStudio is an Integrated Development Environment (IDE) that
facilitates issuing of commands in R interactively.
It has many convenient and easy-to-use tools for coding, plot
management, object browsing, package management, among other
things.
To download RStudio: https://www.rstudio.com/ Quantitative Research Methods for Political Science - ESPoliquit
2 / 35
Introduction to RStudio

The RStudio 4-Panel Component Layout

1. Upper left: Script window for editing R script files.


2. Lower left: Console window for directly interacting with an R
process.
3. Upper right: Workspace and history browser
4. Lower right: File browser, plot browser, package management,
help browser.

The RStudio Script Window


▶ Scripts. The RStudio code is saved as a script with an .R
extension.
▶ To create a new script, press the shortcut Ctrl+Shift+n or go
to File, New File, then R Script.

Quantitative Research Methods for Political Science - ESPoliquit


3 / 35
Basic Data Management and Functions in RStudio

The RStudio Script Window


▶ Working directories. To set a working directory that contains
the datasets to be used in the program, use the setwd()
command.

#Set the working directory


setwd("D:/CANVAS CLASSES/POS 2108 Quantitative Methods/data

▶ When copy-pasting directories, make sure to change backslash


to forward slash.
▶ Comments. These make source code easier for us to
understand, and are generally ignored by the computer.
▶ Comments are preceded by the # symbol.

Quantitative Research Methods for Political Science - ESPoliquit


4 / 35
Basic Data Management and Functions in RStudio
Installing Packages
▶ The packages tab in the lower right panel can be used to
install and load different packages.
▶ Loaded packages are marked with a check symbol.
▶ Alternatively, packages can be installed and loaded using
install.packages(” “) and library() commands in the script
window, respectively.

library(MASS)

Other Remarks
▶ More syntax will be added as we progress with the learning.
▶ R syntax is case-sensitive (e.g., Variables x and X are
different. Thus, creating objects with the same name but
different case is highly discouraged.
▶ It is likely that a process can be coded and done in a number
of ways in R. Quantitative Research Methods for Political Science - ESPoliquit
5 / 35
Basic Data Management and Functions in RStudio
Other Remarks
▶ Assignment operator. R expressions at the right can be
assigned to objects at its left using the assignment operator
← or =.

x<-c(1,2,3)
x

[1] 1 2 3

y=c(1,2,3)
y

[1] 1 2 3
Quantitative Research Methods for Political Science - ESPoliquit
6 / 35
Basic Data Management and Functions in RStudio
Vector

ages=c(12,15,14,17,18,11,13,14,14)
mean(ages)

[1] 14.22222

hist(ages)

Histogram of ages
3
Frequency

2
1
0

11 12 13 14 15 16 17 18

ages
Quantitative Research Methods for Political Science - ESPoliquit
7 / 35
EXPLORING AND VISUALIZING DATA

Objectives
▶ To identify the ways to characterize data before doing serious
analysis
▶ To understand the appropriate measure
▶ To error-check

Quantitative Research Methods for Political Science - ESPoliquit


8 / 35
3.1 Characterizing Data

1. What does it mean to characterize your data?

▶ Know how many observations are contained in your data


▶ Know the distribution of those observations over the range of
your variable(s)

2. What kinds of measures (ratio, interval, ordinal, nominal) do


you have?
3. What are the ranges of valid measures for each variable?
4. How many cases of missing (no data) or miscoded (measures
that fall outside the valid range) do you have? (Data cleaning)
5. What do the coded values represent?

Quantitative Research Methods for Political Science - ESPoliquit


9 / 35
3.1 Characterizing Data
Data: Number of Registered Filipino Emigrants by Place of Origin
in the Philippines: 1988-2020, Commission on Filipinos Overseas
Source: https://data.gov.ph/index/public/dataset/Commission%
20of%20Filipino%20Overseas:%20Statistical%20Profile%20Of%
20Registered%20Filipino%20Emigrants%20%5B1981-
2020%5D/q8o5akrp-ers2-7hag-m3ds-h7eki6wqhqs9

setwd("D:/CANVAS CLASSES/POS 2108 Quantitative Methods/data


patients=read.csv("patients.csv")
head(patients, 4)

Sex Blood.Type Age Systolic.BP..mm.Hg.


1 F A 38 125
2 F B 27 130
3 F O 32 120
4 M AB 55 126
Quantitative Research Methods for Political Science - ESPoliquit
10 / 35
3.1 Characterizing Data
A histogram creates intervals of equal length, called bins, and
displays the frequency of observations in each of the bins.

hist(patients$Age, main = "Histogram", xlab = "Age")

Histogram
20
15
Frequency

10
5
0

10 20 30 40 50 60 70 80

Age

Quantitative Research Methods for Political Science - ESPoliquit


11 / 35
3.1 Characterizing Data
A density curve is a graph that shows probability.

library(sm)
sm.density(patients$Age, model = "Normal", xlab = "Age")
0.05
0.04
Probability density function

0.03
0.02
0.01
0.00

20 40 60 80

Age

Quantitative Research Methods for Political Science - ESPoliquit


12 / 35
3.1 Characterizing Data
A frequency counts for each level is called a frequency table.

addmargins(table(patients$Sex))

F M Sum
30 20 50

To view in percentage

addmargins(prop.table(table(patients$Sex))*100)

F M Sum
60 40 100
Quantitative Research Methods for Political Science - ESPoliquit
13 / 35
3.1 Characterizing Data
A contingency table, sometimes called a two-way frequency
table, is a tabular mechanism with at least two rows and two
columns used in statistics to present categorical data in terms of
frequency counts.

library(kableExtra)
cont_table <- table(patients[,c("Sex","Blood.Type")])
kable(addmargins(cont_table),
caption= "Contingency Table of Sex and Blood Type")

Table 1
Contingency Table of Sex and Blood Type

A AB B O Sum
F 6 7 6 11 30
M 7 3 0 10 20
Sum 13 10 6 21 50
Quantitative Research Methods for Political Science - ESPoliquit
14 / 35
3.1 Characterizing Data
Central Tendency

1. The Mean: The arithmetic average of the values

mean(patients$Age)

[1] 41.78

Mean of Age per Sex

aggregate(Age~Sex, data= patients, mean)

Sex Age
1 F 40.23333
2 M 44.10000
Quantitative Research Methods for Political Science - ESPoliquit
15 / 35
3.1 Characterizing Data
Central Tendency

2. The Median: The value at the center of the distribution

median(patients$Age)

[1] 40

Median of Age per Sex

aggregate(Age~Sex, data= patients, median)

Sex Age
1 F 37
2 M 41
Quantitative Research Methods for Political Science - ESPoliquit
16 / 35
3.1 Characterizing Data
Central Tendency
3. The Mode: The most frequently occurring value

getmode <- function(v) {


uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(patients$Age)

[1] 41

Median of Age per Sex

aggregate(Age~Sex, data= patients, getmode)

Sex Age
1 F 32
2 M 41 Quantitative Research Methods for Political Science - ESPoliquit
17 / 35
3.1 Characterizing Data

Level of Measurement and Central Tendency


▶ cannot perform mathematical functions on ordinal or nominal
level measures
▶ data must be measured at the interval level to calculate a
meaningful mean

Appropriateness
▶ Mean is an appropriate measure for closely related values
▶ Median is an appropriate measure for variable with extreme
values
▶ Mode is an appropriate measure for categorical/qualitative
variable

Quantitative Research Methods for Political Science - ESPoliquit


18 / 35
3.1 Characterizing Data

Moments
In addition to measures of central tendency, “moments” are
important ways to characterize the shape of the distribution of a
sample variable.
Moments are applicable when the data measured is interval type
(the level of measurement). The first four moments are those that
are most often used.

1. The expected value of a variable is the value you would


obtain if you could multiply all possible values within a
population by their probability of occurrence.
Pn
i=1 Xi
E (X ) = X̄ =
n

Quantitative Research Methods for Political Science - ESPoliquit


19 / 35
3.1 Characterizing Data
Moments
2. The variance of variable is a measure that illustrates how a
variable is spread, or distributed, around its mean. The
numerator of the formula is known as Total Sum of Square
(TSSx ).
Pn
(Xi − X̄ )2
sx = i=1
2
n−1
Taking the square root to both sides of the equation will
result to the standard deviation.
3. Skewness is a measure of the asymmetry of a distribution.
Skewness is calculated by dividing the third moment by the
cube of the standard deviation.
Pn
i=1
(Xi −X̄ )3
n−1
S= rP !3
n
i=1
(Xi −X̄ )2
n−1
Quantitative Research Methods for Political Science - ESPoliquit
20 / 35
3.1 Characterizing Data
Moments
The rule of thumb seems to be:
▶ If the skewness is between -0.5 and 0.5, the data are fairly
symmetrical
▶ If the skewness is between -1 and -0.5 or between 0.5 and 1,
the data are moderately skewed
▶ If the skewness is less than -1 or greater than 1, the data are
highly skewed
4. The kurtosis of a distribution refers to the peak of a variable
(i.e., the mode) and the relative frequency of observations in
the tails. Kurtosis is calculated by dividing the fourth moment
by the square of the second moment.
Pn
i=1
(Xi −X̄ )4
n−1
K= rP !2
n
i=1
(Xi −X̄ )2
n−1
Quantitative Research Methods for Political Science - ESPoliquit
21 / 35
3.1 Characterizing Data

Moments
Most often, kurtosis is measured against the normal distribution.
▶ If the kurtosis is close to 0, then a normal distribution is often
assumed. These are called mesokurtic distributions.
▶ If the kurtosis is less than zero, then the distribution is light
tails and is called a platykurtic distribution.
▶ If the kurtosis is greater than zero, then the distribution has
heavier tails and is called a leptokurtic distribution.

Quantitative Research Methods for Political Science - ESPoliquit


22 / 35
3.1 Characterizing Data
library(pastecs)
ds=round(stat.desc(patients[3:4], basic=T,desc=T, norm=F),2)
kable(ds, caption="Descriptive Statistics")

Table 2
Descriptive Statistics

Age Systolic.BP..mm.Hg.
nbr.val 50.00 50.00
nbr.null 0.00 0.00
nbr.na 0.00 0.00
min 19.00 105.00
max 72.00 163.00
range 53.00 58.00
sum 2089.00 6546.00
median 40.00 129.50
mean 41.78 130.92
SE.mean 1.72 2.07
CI.mean.0.95 3.46 4.16
var 147.97 214.28
std.dev 12.16 14.64
coef.var 0.29 0.11
Quantitative Research Methods for Political Science - ESPoliquit
23 / 35
3.1 Characterizing Data
Apart from central tendency and moments, probability
distributions can also be characterized by order statistics. Order
statistics are based on the position of a value in an ordered list.
Typically, the list is ordered from low values to high values.

1. The median is the value at the center of the distribution,


therefore 50% of the observations in the distribution will have
values above the median and 50% will have values below.
2. The first quartile, Q1 , consists of observations whose values
are within the first 25% of the distribution. The values of the
second quartile, Q2 , are contained within the first half (50%)
of the distribution, and is marked by the distribution’s median.
The third quartile, Q3 , includes the first 75% of the
observations in the distribution.

The interquartile range (IQR) measures the spread of the ordered


values. It is calculated by subtracting Q1 from Q3 . The IQR
contains the middle 50% of the distribution. Quantitative Research Methods for Political Science - ESPoliquit
24 / 35
3.1 Characterizing Data
3. Percentiles list the data in hundredths. Another way to
compare a variable distribution to a theoretical distribution is
with a quantile-comparison plot (qq plot). A qq plot displays
the observed percentiles against those that would be expected
in a normal distribution.

qqnorm(patients$Age); qqline(patients$Age)

Normal Q−Q Plot


70
60
Sample Quantiles

50
40
30
20

−2 −1 0 1 2

Theoretical Quantiles
Quantitative Research Methods for Political Science - ESPoliquit
25 / 35
3.1 Characterizing Data
The summary function produces the minimum value, the first
quartile, median, mean, third quartile, and max value. The generic
function quantile produces sample quantiles corresponding to the
given probabilities.
summary(patients$Age)

Min. 1st Qu. Median Mean 3rd Qu. Max.


19.00 34.00 40.00 41.78 47.50 72.00

quantile(patients$Age) #default

0% 25% 50% 75% 100%


19.0 34.0 40.0 47.5 72.0

quantile(patients$Age, c(0.25, 0.75)) #specific

25% 75%
34.0 47.5
Quantitative Research Methods for Political Science - ESPoliquit
26 / 35
3.1 Characterizing Data
A boxplot is a standardized way of displaying the distribution of
data based on a five number summary.

boxplot(patients$Age, horizontal = T, col = "gold")

20 30 40 50 60 70

There are two data points that are different from the other
observations. Quantitative Research Methods for Political Science - ESPoliquit
27 / 35
3.1 Characterizing Data
An outlier is a data point that differs significantly from other
observations. It is below the lower fence (lf) and above the upper
fence (uf).

lf = Q1 − 1.5 ∗ IQR and up = Q3 + 1.5 ∗ IQR

lf=quantile(patients$Age, 0.25)-1.5*IQR(patients$Age)
uf=quantile(patients$Age, 0.75)+1.5*IQR(patients$Age)
c(lf,uf)

25% 75%
13.75 67.75

out <- boxplot.stats(patients$Age, do.out = T)


out$out

[1] 72 70
Quantitative Research Methods for Political Science - ESPoliquit
28 / 35
3.1 Characterizing Data
An outlier is a data point that differs significantly from other
observations. It is below the lower fence (lf) and above the upper
fence (uf).

lf = Q1 − 1.5 ∗ IQR and up = Q3 + 1.5 ∗ IQR

x = patients$Age
boxplot(x, ylim=c(13, 75), horizontal=T)
text(34, 1.2, "Q1", pos=3)
text(40, 1.2, "Q2", pos=3)
text(47.50, 1.2, "Q3", pos=3)
text(70, 1, "70", pos=1)
text(72, 1, "72", pos=3)
text(13.75, 1, "LF", pos=3)
text(67.75, 1, "UF", pos=3)
points(c(13.75,67.75), c(1,1), pch=c(25,24), col ="red" )
Quantitative Research Methods for Political Science - ESPoliquit
29 / 35
3.1 Characterizing Data

Q1 Q2 Q3

LF UF 72
70

20 30 40 50 60 70

Quantitative Research Methods for Political Science - ESPoliquit


30 / 35
Activity 3.0
Using the wages data, do the following.

setwd("D:/CANVAS CLASSES/POS 2108 Quantitative Methods/data


wages=read.csv("wages.csv")
head(wages)

ID Education Race Sex Status Age ExpHigh Wage


1 1 16 White Male Single 45 1 24.02
2 2 14 White Female Single 45 1 14.40
3 3 13 White Female Married 45 1 11.43
4 4 16 Nonwhite Male Married 45 1 7.00
5 5 12 Hispanic Male Married 45 1 6.25
6 6 12 White Male Married 45 1 12.50

dim(wages)

[1] 478 8
Quantitative Research Methods for Political Science - ESPoliquit
31 / 35
Activity 3.0
You need to have 50 samples only from the data using the code
below with set.seed (your ID no. last 3 digits).

set.seed(143)
wage=wages[sample(nrow(wages), 50,
replace = FALSE, prob = NULL),]

A. Using the numerical variable (5 marks each)

1. Create a density curve for the wage. Describe the distribution.


2. Find the mean, median, variance, standard deviation, first
quartile, third quartile, and IQR for the wage.
3. Construct a boxplot, then describe the shape of the
distribution, and identify any outliers for the wage.
4. Construct a qq plot for the wage.
5. What race has the highest mean wage?
Quantitative Research Methods for Political Science - ESPoliquit
32 / 35
Activity 3.0

B. Using the categorical variable (5 marks each)


Choose your own two categorical variables.

1. Create a frequency contingency table.


2. Create a percentage contingency table.

C. Mixed variables (5 marks each)

1. Create boxplots for wages per status.


2. Compute the lower and upper fences for each status’ wage
and identify outliers if there are any.

Quantitative Research Methods for Political Science - ESPoliquit


33 / 35
Reference

▶ Jenkins-Smith, Hank C., Ripberger, Joseph T., Copeland,


Gary, Nowlin, Matthew C., Hughes, Tyler, Fister, Aaron L.,
Wehde, Wesley (2017). Quantitative Research Methods for
Political Science, Public Policy and Public Administration
(With Applications in R). Retrieved from
https://shareok.org/handle/11244/52244. DOI:
10.15763/11244/52244

Quantitative Research Methods for Political Science - ESPoliquit


34 / 35
THANK YOU!

Quantitative Research Methods for Political Science - ESPoliquit


35 / 35

You might also like