Statistical Computing
Statistical Computing
Erik B. Erhardt
Fall 2015
Contents
1 R plotting 1
Chapter 1
R plotting
Edward Tufte
Presenting data and
information
Tufte on Graphical Excellence
(VDQI p. 13)
Excellence in statistical graphics consists of complex ideas communi-
cated with clarity, precision, and efficiency. Graphical displays should
induce the viewer to think about the substance rather than about
methodology, graphic design, the technology of graphic production,
or something else
Graphics reveal data. Indeed graphics can be more precise and reveal-
ing than conventional statistical computations. Consider Anscombe’s
quartet1: all four of these data sets are described by exactly the same
linear model (at least until the residuals are examined).
# read data in wide format from space delimited text
# textConnection() will read text into an object
anscombe <- read.table(text = "
X Y X Y X Y X Y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
", header=TRUE)
#anscombe
# reformat the data into long format
anscombe.long <- data.frame(
x = c(anscombe[, 1], anscombe[, 3]
, anscombe[, 5], anscombe[, 7])
, y = c(anscombe[, 2], anscombe[, 4]
, anscombe[, 6], anscombe[, 8])
1
FJ Anscombe, “Graphs in Statistical Analysis,” American Statistician, 27 (February 1973),
17-21.
3
, g = sort(rep(1:4, nrow(anscombe)))
)
head(anscombe.long, 2)
## x y g
## 1 10 8.04 1
## 2 8 6.95 1
tail(anscombe.long, 2)
## x y g
## 43 8 7.91 4
## 44 8 6.89 4
# function to calculate selected numerical summaries
anscombe.sum <- function(df) {
results <- as.list(new.env()); # create a list to return with data
1 2 3 4
n 11.00 11.00 11.00 11.00
x.mean 9.00 9.00 9.00 9.00
y.mean 7.50 7.50 7.50 7.50
eq.reg.(Intercept) 3.00 3.00 3.00 3.00
eq.reg.x 0.50 0.50 0.50 0.50
b1.se 0.12 0.12 0.12 0.12
b1.t 4.24 4.24 4.24 4.24
x.SS 110.00 110.00 110.00 110.00
ResSS 13.76 13.78 13.76 13.74
RegSS 27.51 27.50 27.47 27.49
xy.cor 0.82 0.82 0.82 0.82
xy.r2 0.67 0.67 0.67 0.67
However. . .
Anscombe's quartet
1 2
12.5
10.0 ●
● ● ●
● ● ●
●
● ● ●
7.5 ●
● ●
●
●
●
5.0 ● ●
●
y 3 4
●
12.5 ●
10.0
● ●
●
●
● ●
●
7.5 ●
●
●
●
●
● ●
●
● ●
● ●
●
5.0
5 10 15 5 10 15
x
Minard
One of the best
The narrative graphic of space and time par excellence is perhaps the
following plot by Charles Joseph Minard (1781–1870), the French engi-
neer, which shows the terrible fate of Napoleon’s army in Russia. This
combination of data map and time-series, drawn in 1869, portrays a se-
quence of devastating losses suffered in Napoleon’s Russian campaign of
1812.
http://www.danvk.org/wp/2009-12-04/a-new-view-on-minards-napoleon/
7
2. Show causality
8 R plotting
8. Don’t dequantify
Response variable
10 10
4
5 5
2
0 0
0
1 2 3 1 2 3 1 2 3
Condition Condition Condition
EA Allen, EB Erhardt, and VD Calhoun. Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron,
74:603–608, 2012.
Napoleon was defeated by the winter, not the opposing army, as shown
by the temperature scale on the bottom of Minard’s graph.
Red spots indicate water pumps. Lines indicate location death count.
11
5. temperatures
6. dates
Don’t let the accidents of the modes of production break up the text,
images, and data.
12 R plotting
3
4
Average IC potential (µV)
−4 −6 * * * = p < 0.001
−500 0 500 1000 −500 −200 0 500 1000
Time (ms) Time (ms)
EA Allen, EB Erhardt, and VD Calhoun. Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron,
74:603–608, 2012.
b ∆β weight b ∆β weight
≥5
|t |
Novel – Standard Novel – Standard 0
−1.6 0 +1.6 −1.6 0 +1.6
L R L R
Z = 22 Z=2 Z = −19 Z = −39 Z = 22 Z=2 Z = −19 Z = −39
b1 b2
1.5 H0: µN = µ S
(%∆BOLD/stim)
1 Ha: µN ≠ µS
Novel β
EA Allen, EB Erhardt, and VD Calhoun. Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron,
74:603–608, 2012.
13
observations: Team A
Score decrease indicates improvement
Follow-up − Baseline
20
0 BDI
BDI−II
−20 one-sided CI
mean
−40
−40 −20 20
5
0
BHS
BHS
−5
−10
CORE−OM
CoreOM
−1
−2
−3
−3 −2 −1 1
CORE−OM−R
1
0
CoreOM−R
−1
−2
CR Koons, B O’Rourke, B Carter, EB Erhardt. Negotiating for improved reimbursement for Dialectical Behavior Therapy: A
ian, not Latin, because he wanted to reach a wider audience than the
scientific elite.
9
Galileo Galilei, History and Demonstrations
Concerning Sunspots and Their Phenomena
(Rome, 1613), translated by Stillman Drake,
Discoveries and Opinions of Galileo (Garden
City, New York, 1957), pp. 115-116.
30°N 30 o N
equator equator
30°S 30°S
1.0% 1.0%
0.1% 0.1%
They have an inherent credibility with the viewer because they show
a lot of data – “I know what I’m talking about and I’m showing all my
data to you.”
Anscombe's quartet
1 2
12.5
10.0 ●
● ● ●
● ● ●
●
● ● ●
7.5 ●
● ●
●
●
●
5.0 ● ●
●
y
3 4
●
12.5 ●
10.0
● ●
●
●
● ●
●
7.5 ●
●
●
●
●
● ●
●
● ●
● ●
●
5.0
5 10 15 5 10 15
x
Start by asking, what is the intellectual task that this display is sup-
posed to help with?
Examples of “Bad”
are easy to find
17
18 R plotting
19
20 R plotting
Beautiful, informative
plots in R
Introduction to the
ggplot package.
Plotting with ggplot2
Beautiful plots made simple
# only needed once after installing or upgrading R
install.packages("ggplot2")
# each time you start R
# load ggplot2 functions and datasets
21
library(ggplot2)
ggplot()
# specify the dataset and variables
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point() # add a plot layer with points
print(p)
40
●
●
●
●
● ●
● ●
● ●●
30 ● ● ●
● ● ● ● ●● ●
hwy
● ● ● ● ●
● ● ●● ● ● ● ●
● ● ● ●● ● ●● ●● ● ● ● ●
● ● ● ●● ● ● ● ●
●● ●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ●
20 ● ● ●●
● ● ● ●● ●
● ● ● ● ● ●●
●● ●● ● ●● ● ● ● ● ● ●
●● ● ●
● ●● ●●● ● ●
● ●
2 3 4 5 6 7
displ
Additional variables
Aesthetics and faceting
docs.ggplot2.org/current/geom_point.html
Aesthetics
The legend is chosen and displayed automatically
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point(aes(colour = class))
print(p)
40
●
●
●
●
class
● ●
● ● ● 2seater
● ● ●
● compact
30 ● ● ●
● ● ● ● ● ● ● ● midsize
hwy
● ● ● ● ●
● ● ● ● ● ● ● ●
● minivan
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● pickup
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● subcompact
● ● ● ● ● ● ●
● suv
● ● ● ● ●
● ●
20 ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
2 3 4 5 6 7
displ
Aesthetics
Behavior
24 R plotting
AestheticDiscrete Continuous
colour Rainbow of col- Gradient from
ors red to blue
size Discrete size Linear mapping
steps between radius
and value
shape Different shape Shouldn’t work
for each
drv
● 4
40 f
r
cyl
●
4
● 5
● 6
30
● 7
hwy
●8
●
● ●
● ● ●
● ●
● ● class
●
●
●
●
●
● ● 2seater
● compact
20 ●
● ● midsize
● ● ● ●● ● ● minivan
● ● ●● ●●
●● ●● ● ●● ●●● ● ● ● pickup
●● ●
● ●● ●● ●● ● subcompact
● ● ● suv
●
2 3 4 5 6 7
displ
drv
4
40 f
r
cyl
4
5
6
30
7
hwy
class
2seater
compact
20 midsize
minivan
pickup
subcompact
suv
2 3 4 5 6 7
displ
Faceting
4 5 6 8
40
●
●
●
●
●●
● ●
●●
●
30 ●● ●
●●●●● ● ●
hwy
● ● ●●●
●●●
● ●●●●
●●●●
● ●●●
●●●●● ● ●
● ● ●●
● ●● ● ●
●
● ●
●●● ● ●
● ● ●● ●● ●
● ●● ● ●
● ●
20 ● ● ●
●
●●● ●
● ●
●● ●● ●●
●
●
● ●
●● ●
●●●●●
●● ●
●
● ●●
● ●
● ●
●●●●
● ●
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ
40
30
●
4
● ●
● ● ●
● ● ● ●
● ●
● ●
● ●
20 ●
● ●
●
● ●● ●
● ● ● ● ●●
●● ●● ● ●● ● ● ● ● ●
●● ●
● ●● ● ● ● ●
● ●
●
●
●
40
●
●
●
●
● ●
● ●
● ●●
hwy
30 ●
●
●
●
● ●
●
●● ●
● ● ● ● ●
f
● ● ● ● ● ● ●
● ●● ● ●● ●● ●
● ● ●
● ●● ● ●
● ● ●
● ●
●
20
●
40
30
r
● ● ● ●
● ●
● ●
● ●
●
●
20 ●●
●
● ● ● ●
●
●
2 3 4 5 6 7
displ
4 5 6 8
40
30
●
4
● ●
● ●●
● ● ●●
● ●
● ●
● ●
20 ●
●●●
●
●
● ●
●● ●● ●
●
●
● ●●● ●
●●●●● ●
●
● ●
● ●● ●●●●
● ●
●
●
●
40
●
●
●
●
●●
● ●
●●●
hwy
30 ●● ●
●●●●● ● ●
● ● ●●●
f
●●● ●●●●
●● ●●●●●●●
● ● ●
● ●
●●●
● ●●
● ●
●
20
●
40
30
r
●● ● ●
● ●
● ●
● ●
●
●
20 ●
●
●
● ●●●
●
●
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ
40
●
●
●
●
● ● ●●
30 ●●
●●●●●
●
●●●
●
●
●
●● ●
●●●● ●● ●● ●●
● ● ●● ●● ●● ●●● ●●●
● ● ●●● ●● ● ●
● ● ●
● ● ●
20
30 ● ● ●
● ●
●
●● ● ●●
●
● ●● ● ●
●● ●
● ● ● ●
● ●
20 ●
●●
●●
●
●
●
● ● ● ● ●● ●● ●
●● ●
●● ●
●
suv
40
30
●
●
●
●
●
●
20 ● ●
● ● ● ●● ●
●●
●
●●●
●● ●● ●● ● ● ● ● ●
●●
● ●● ● ●●
● ●
●
2 3 4 5 6 7
displ
Improving plots
p <- ggplot(mpg, aes(x = cty, y = hwy))
p <- p + geom_point()
print(p)
28 R plotting
● ●
40
●
● ●
●
●
● ●
● ● ●
● ● ● ●
30 ● ● ●
● ● ● ● ● ●
hwy
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ●
●
20 ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
●
●
10 15 20 25 30 35
cty
jitter
p <- ggplot(mpg, aes(x = cty, y = hwy))
p <- p + geom_point(position = 'jitter')
print(p)
● ●
●
40
●
●
●
● ●
●
●
●
●
●
●
●
● ● ●
● ●●
●
● ● ●
30 ●
●●● ● ● ● ●
●●●
●●● ●● ●
●
hwy
●●●●
● ● ●● ●
●●
●
●●● ● ●● ●
● ●● ● ● ●
●●●
● ●● ●● ●●
●
●●● ●●
●● ● ●
●
●● ●● ●
● ●
●● ● ●
● ● ● ●● ●● ● ●
● ●
● ● ●
● ● ● ●● ●●● ●
●●●●●
● ●
● ●●
●●
● ●
●●
● ●●●
20 ●
●● ● ●
● ●
●●● ● ●●● ●●
● ●
●●
●
●● ●
●● ● ●● ●
● ● ● ●●
●
● ●●
● ●
●● ●
● ● ● ●
● ● ● ●●●● ● ● ●
●● ● ●● ●
●
●●●
●
●
●
●●
● ●
●●
●●
●
10 20 30
cty
● ●
40
●
●
●
●
● ●
● ●
● ●
30 ● ●
● ● ●
hwy
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
20 ● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ●
reorder
reordering the class variable by the mean hwy
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point()
print(p)
● ●
40
●
●
●
●
● ●
● ●
● ●
30 ● ●
● ● ●
hwy
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
20 ● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ●
reorder
and jitter
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point(position = 'jitter')
print(p)
● ●
40
●
●
●
●
●
● ●
●
● ● ●
●
●● ● ●●
●
30 ● ● ●
●
● ●●
●● ● ● ●● ● ● ●
●● ●●●
● ● ●
hwy
● ●
● ●
● ● ● ●
● ● ●●
● ● ●●● ● ● ●
●
● ●● ●● ● ● ●●
● ●
●●●●● ●
●● ●●● ● ●● ● ●
●● ● ●
● ● ● ● ● ● ●
● ●●
● ●
●
● ● ● ● ●
●● ● ● ● ●
● ●
●
● ● ●
● ● ●
● ● ●●
● ●●
●
●
● ●
20 ● ● ●● ● ● ● ●
●
● ● ●
●● ●
●●● ●●
● ●
● ● ●● ●
● ● ●●
● ●● ●●● ●● ●●
●● ● ●
● ●● ●● ●●●
●
●●●
●
●●● ●
● ●
● ●● ●●
● ●● ●
● ● ● ●
● ●
●
●
● ● ● ●
●
reorder
●
●
●
40
●
●●
●●
●
● ●
●●●
●
●
● ●
● ●●
●● ●
30 ●
●
●● ●●
●
● ● ●●
● ●●●●
● ●●
hwy
● ●
● ●
●● ● ● ●
● ●●●● ● ●●
● ●● ●
● ● ●●●● ●●● ●
● ●
●● ●
● ●●
● ●●●
● ●
●
●●●
●
● ● ● ●
●
● ● ● ● ●● ●
●
● ●●● ● ● ● ●●
● ●
● ●
● ●
● ●● ● ●
● ● ●● ●
●
●
● ●
● ●
20 ● ●● ●● ●
● ●
●
●●● ●●
●●●●
● ●●●●
● ●
●
●●●●● ● ●●
●●
●●●
● ● ●
●
●
●
●●●●
●
●
●
● ●
●
● ●●
●
● ●
● ●
●
●●
● ●●●●
●
●
●●
●
●●
10
pickup suv minivan 2seater midsize subcompact compact
reorder(class, hwy)
reorder
and boxplot
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_boxplot()
print(p)
● ●
40
30
hwy
●
●
●
●
●
●
20
● ●
reorder
●
●● ●
●●
40
●●
●
●
●
●●
●
● ●
●●
● ●
●● ●●
●●
● ●
30 ●
●
● ●
●●● ●●
● ● ●●●
● ●
●●●
hwy
● ●●
●●● ● ●
● ●
●● ●● ●●●
●●● ●●
● ●●●
● ●●●● ●●
● ● ● ●
●● ●●●● ● ●●
● ●
●●● ● ●
●
● ●
●● ● ●● ●
●
● ●
●●
●●●
● ● ●
●● ●
● ● ●
● ●● ● ●●
● ● ● ●
●●
●● ● ●● ●
● ●
● ●
●● ●
20 ● ●
●●●●● ●
●
●●●
● ●●●
●●●●
●
● ●●
●
●●●●●
●●
● ●●
●●●
● ●●● ●
● ●●
●
●●●●●
● ●●●●
●●
●
●●
●●
● ●
●
●● ●●
●
●
● ● ●●
●●
●
●
●● ●●
●
reorder by median
●● ●
●
●●
40
●
●
●
●
●●
●
●
● ●
●●
●
●
●● ●●
● ●
●
●●
30 ●
●
●
● ● ●
●
●●● ●●
●
●● ●●●
● ● ●
hwy
●● ● ●
●
●●
● ● ● ●
● ●●
●●
● ●●●●
●
●●
● ●●●
● ●●
●● ●
●
● ●
● ●●● ●●● ● ●●
● ● ●●●
● ● ●
●● ●●
●● ● ● ●●
●
● ● ●● ●● ●
● ●●● ● ●●
●
●● ●● ● ● ●
● ●
● ●●
● ●●
● ●
●● ●
20 ●●● ● ●
●●●
●● ●●
●● ●●
●●●
● ●●●
● ●●●●●
●
●●●● ●●
●●
● ●
●●●●●● ● ●●
●● ●
●●
●●●
●●
●● ●●
● ●
●
●●● ●●●
●●
● ●
●●
●●
●● ●
●●
reorder by median
and boxplot and jitter (switched order)
p <- ggplot(mpg, aes(x = reorder(class, hwy, FUN = median), y = hwy))
p <- p + geom_boxplot(alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = .1))
print(p)
● ●●
●
●
40
●
●
●
●
●
●●
●
● ●
● ●
●●
●●● ●
●●●
● ●
●
30 ●
●● ● ●
●●● ● ●
●● ● ●●
● ●● ●●
●
hwy
● ● ● ●●
● ●
●● ● ●●
●● ● ●
●●
● ●●● ●
●
● ●●
●●● ●● ●●●
●●
●
● ● ● ● ●
●●●●
● ●●
● ● ●●
●
●● ●● ●●●
●● ●
● ●●
●● ●● ●● ●
●● ● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●●
● ●
● ●
●
● ●
20 ●
●●
●
●●●
● ●● ●●●
●●
● ●●●
●● ●●●●
●
●●
●●●
● ●●
● ●●●
●●●●
● ● ●
● ●●
● ● ●● ●
●●
●
●●● ●●
●●
● ●
●
●
●● ●
●●●
●●
● ●
●●
●●
●