0% found this document useful (0 votes)
34 views36 pages

Statistical Computing

This document contains lecture notes on R plotting. It discusses principles of effective data visualization according to Edward Tufte and provides examples using R code. It analyzes Anscombe's quartet dataset to show how graphs can reveal more than numeric summaries. The datasets have identical regression results but differ visually. It concludes with praise for a famous graphic by Charles Minard showing losses in Napoleon's 1812 Russian campaign, citing it as a prime example combining data and time series information.

Uploaded by

Muhd Kamis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views36 pages

Statistical Computing

This document contains lecture notes on R plotting. It discusses principles of effective data visualization according to Edward Tufte and provides examples using R code. It analyzes Anscombe's quartet dataset to show how graphs can reveal more than numeric summaries. The datasets have identical regression results but differ visually. It concludes with praise for a famous graphic by Charles Minard showing losses in Napoleon's 1812 Russian campaign, citing it as a prime example combining data and time series information.

Uploaded by

Muhd Kamis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

i

Lecture notes for


Statistical Computing 1 (SC1)
Stat 590
University of New Mexico

Erik B. Erhardt

Fall 2015
Contents
1 R plotting 1
Chapter 1

R plotting
Edward Tufte
Presenting data and
information
Tufte on Graphical Excellence
(VDQI p. 13)
Excellence in statistical graphics consists of complex ideas communi-
cated with clarity, precision, and efficiency. Graphical displays should

ˆ show the data

ˆ induce the viewer to think about the substance rather than about
methodology, graphic design, the technology of graphic production,
or something else

ˆ avoid distorting what the data have to say

ˆ present many numbers in a small space


2 R plotting

ˆ make large data sets coherent

ˆ encourage the eye to compare different pieces of data

ˆ reveal the data at several levels of detail, from a broad overview to


the fine structure

ˆ serve a reasonably clear purpose: description, exploration, tabula-


tion, or decoration

ˆ be closely integrated with the statistical and verbal descriptions of


a data set.

Graphics reveal data. Indeed graphics can be more precise and reveal-
ing than conventional statistical computations. Consider Anscombe’s
quartet1: all four of these data sets are described by exactly the same
linear model (at least until the residuals are examined).
# read data in wide format from space delimited text
# textConnection() will read text into an object
anscombe <- read.table(text = "
X Y X Y X Y X Y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
", header=TRUE)
#anscombe
# reformat the data into long format
anscombe.long <- data.frame(
x = c(anscombe[, 1], anscombe[, 3]
, anscombe[, 5], anscombe[, 7])
, y = c(anscombe[, 2], anscombe[, 4]
, anscombe[, 6], anscombe[, 8])

1
FJ Anscombe, “Graphs in Statistical Analysis,” American Statistician, 27 (February 1973),
17-21.
3

, g = sort(rep(1:4, nrow(anscombe)))
)

head(anscombe.long, 2)
## x y g
## 1 10 8.04 1
## 2 8 6.95 1
tail(anscombe.long, 2)
## x y g
## 43 8 7.91 4
## 44 8 6.89 4
# function to calculate selected numerical summaries
anscombe.sum <- function(df) {
results <- as.list(new.env()); # create a list to return with data

results$n <- length(df$x) # sample size


results$x.mean <- mean(df$x) # mean of x
results$y.mean <- mean(df$y) # mean of y
lm.xy <- lm(y ~ x, data=df) # fit slr
results$eq.reg <- lm.xy$coefficients # regression coefficients
results$b1.se <- summary(lm.xy)$coefficients[2,2] # SE of slope
results$b1.t <- summary(lm.xy)$coefficients[2,3] # t-stat of slope
results$x.SS <- sum((df$x-results$x.mean)^2) # x sum of squares
results$ResSS <- sum(lm.xy$residuals^2) # residual SS of y
results$RegSS <- sum((df$y-results$y.mean)^2)-results$ResSS # reg SS
results$xy.cor <- cor(df$x, df$y) # correlation
results$xy.r2 <- summary(lm.xy)$r.squared # R^2 for regression
return(results)
}
# calculate and store summaries by data group g
results.temp <- by(anscombe.long, anscombe.long$g, anscombe.sum)
# make a table
x.table <- cbind( t(t(unlist(results.temp[[1]])))
, t(t(unlist(results.temp[[2]])))
, t(t(unlist(results.temp[[3]])))
, t(t(unlist(results.temp[[4]])))
)
colnames(x.table) <- 1:4 # label the table columns

Those four datasets have many of the same numerical summaries.


4 R plotting

1 2 3 4
n 11.00 11.00 11.00 11.00
x.mean 9.00 9.00 9.00 9.00
y.mean 7.50 7.50 7.50 7.50
eq.reg.(Intercept) 3.00 3.00 3.00 3.00
eq.reg.x 0.50 0.50 0.50 0.50
b1.se 0.12 0.12 0.12 0.12
b1.t 4.24 4.24 4.24 4.24
x.SS 110.00 110.00 110.00 110.00
ResSS 13.76 13.78 13.76 13.74
RegSS 27.51 27.50 27.47 27.49
xy.cor 0.82 0.82 0.82 0.82
xy.r2 0.67 0.67 0.67 0.67

However. . .

These datasets are quite distinct!


library(ggplot2)
p <- ggplot(anscombe.long, aes(x = x, y = y))
p <- p + geom_point()
p <- p + stat_smooth(method = lm, se = FALSE)
p <- p + facet_wrap(~ g)
p <- p + labs(title = "Anscombe's quartet")
print(p)
5

Anscombe's quartet
1 2
12.5

10.0 ●

● ● ●
● ● ●

● ● ●
7.5 ●
● ●


5.0 ● ●

y 3 4

12.5 ●

10.0
● ●


● ●

7.5 ●




● ●

● ●
● ●

5.0

5 10 15 5 10 15
x

Minard
One of the best
The narrative graphic of space and time par excellence is perhaps the
following plot by Charles Joseph Minard (1781–1870), the French engi-
neer, which shows the terrible fate of Napoleon’s army in Russia. This
combination of data map and time-series, drawn in 1869, portrays a se-
quence of devastating losses suffered in Napoleon’s Russian campaign of
1812.

Minard’s graphic was made as an anti-war poster.


6 R plotting

http://www.danvk.org/wp/2009-12-04/a-new-view-on-minards-napoleon/
7

1. Just about everything interesting is a multivariate problem that


requires the expression of three or more dimensions of information,
even something as simple as giving travel directions to someone to
follow over time has four dimensions. We are plagued with highly di-
mensional data and low resolution display surfaces, a problem which
has existed since the first maps were scratched on rocks.

2. We measure progress by improvements in resolution, i.e., an in-


creasing rate of information transfer, the density of the data on the
page.

1. Enforce wise visual comparisons

2. Show causality
8 R plotting

3. The world we seek to understand is multivariate, as our displays


should be

4. Completely integrate words, numbers, and images

5. Most of what happens in design depends upon the quality, relevance,


and integrity of the content

6. Information for comparison should be put side by side

7. Use small multiples

8. Don’t dequantify

9. Meta-principle: thinking and designing are as one

The principles should not be applied rigidly or in a peevish spirit; they


are not logically or mathematically certain; and it is better to violate any
principle than to place graceless or inelegant marks on paper. Most prin-
ciples of design should be greeted with some skepticism. . . (VDQI p. 191)
Force answers to the question “Compared with What?”
Graphics must not quote data out of context.
9

Show more, hide less.

Means in the context of their distributions.


Less information More information
A Bar plots display only two B C Violin plots display the shape of each
numbers (here the mean and min, max, and quartiles) to provide distribution and may be overlayed with
s.e.m.) for each distribution. greater distributional information. descriptive or inferential statistics.
6 ± 1 s.e.m. 15 15
Response variable
Response variable

Response variable

10 10
4

5 5
2

0 0

0
1 2 3 1 2 3 1 2 3
Condition Condition Condition

EA Allen, EB Erhardt, and VD Calhoun. Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron,

74:603–608, 2012.

We are looking at information to understand mechanisms.

Policy reasoning is about examining causality.


10 R plotting

Napoleon was defeated by the winter, not the opposing army, as shown
by the temperature scale on the bottom of Minard’s graph.

Next: In September 1854, central London suffered an outbreak of


cholera. To stop that outbreak, Dr. John Snow made a map. By
seeing, visually, where the cholera deaths were clustered, Snow showed
that the water from a pump on Broad Street was to blame. His work
addressed an ongoing medical debate — in what is widely regarded as one
of the most important early examples of epidemiology, he clearly linked
choleras spread to water instead of air.

Red spots indicate water pumps. Lines indicate location death count.
11

3. The world we seek to understand is multivariate


as our displays should be
The Minard graph has six dimensions:

1. size of the army

2. x-dimensional route of the march

3. y-dimensional route of the march

4. direction of the march

5. temperatures

6. dates

Don’t let the accidents of the modes of production break up the text,
images, and data.
12 R plotting

A Commonly seen displays comparing data between B


groups or conditions. helping the viewer make correct inferences. Annotation and
examples clarify data properties and models.
a 4 a 6 target

3
4
Average IC potential (µV)

Average IC potential (µV)


2 Error Error, 95% CI
2
1 Correct Correct, 95% CI
0 0
−1
−2
−2 H0: µE = µ C
−3 −4 Ha: µE ≠ µC

−4 −6 * * * = p < 0.001
−500 0 500 1000 −500 −200 0 500 1000
Time (ms) Time (ms)

EA Allen, EB Erhardt, and VD Calhoun. Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron,

74:603–608, 2012.

b ∆β weight b ∆β weight
≥5
|t |
Novel – Standard Novel – Standard 0
−1.6 0 +1.6 −1.6 0 +1.6

L R L R
Z = 22 Z=2 Z = −19 Z = −39 Z = 22 Z=2 Z = −19 Z = −39
b1 b2
1.5 H0: µN = µ S
(%∆BOLD/stim)

1 Ha: µN ≠ µS
Novel β

0.5 = p < 0.001


0 n = 28
subjects
−0.5
−0.5 0 0.5 1 1.5
Standard β (%∆BOLD/stim)

EA Allen, EB Erhardt, and VD Calhoun. Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron,

74:603–608, 2012.
13

Score difference (Follow-up - Baseline) Location key


for individual
 Team B

observations: Team A
Score decrease indicates improvement
Follow-up − Baseline




















20

0 BDI


BDI−II

−20 one-sided CI 
mean
−40
−40 −20 20
5
















0
BHS
BHS

−5


−10

−15 −15 −10 −5 5


1
























CORE−OM

CoreOM 


−1

−2 

−3
−3 −2 −1 1
CORE−OM−R

1




















0

CoreOM−R


−1

−2 

0 200 400 600 800 1000 1200


Days −3 −2 −1 0 1

CR Koons, B O’Rourke, B Carter, EB Erhardt. Negotiating for improved reimbursement for Dialectical Behavior Therapy: A

successful project. Cognitive and Behavioral Practice. 2013.

5. Most of what happens in design


depends upon the quality, relevance, and integrity of the content
To improve a presentation, get better content.

If your numbers are boring you have the wrong numbers.

Design won’t help, it is too late.

6. Information for comparison


should be put side by side
Within the eye span, not stacked in time on subsequent pages.
Galileo published a book in 1613 which reported the discovery of
sunspots and the rings of Saturn for the first time. He wrote in Ital-
14 R plotting

ian, not Latin, because he wanted to reach a wider audience than the
scientific elite.

9
Galileo Galilei, History and Demonstrations
Concerning Sunspots and Their Phenomena
(Rome, 1613), translated by Stillman Drake,
Discoveries and Opinions of Galileo (Garden
City, New York, 1957), pp. 115-116.

As more observations were collected daily, small multiple diagrams


recorded the data indexed on time (a design simultaneously enhancing
dimensionality and information density0, with the labeled sunspots
parading along alphabetically. This profoundly multivariate analysis —
showing sunspot location in two-space, time, labels, and shifting relative
orientation of the sun in our sky — reflects data complexities that arise
because a rotating sun is observed from a rotating and orbiting earth:
15

Sun Latitude 1900 1920 1940 1960 1980


90oN 90°N

30°N 30 o N

equator equator

30°S 30°S

1.0% 1.0%

0.1% 0.1%

Percent of area of sun 1900 1920 1940 1960 1980


covered by sunspots
At top, a Maunder diagram from 1880 to
1980, with the sine of the latitude marking
sunspot placement. Color coding (the
lighter, the larger) reflects the logarithm of
the area covered by sunspots within each
areal bin of data. The lower time-series, by
summing over all latitudes, shows the total
area of the sun's surface covered by sunspots
at any given time during the hundred-year
sequence. Diagrams produced by David H.
Hathaway, George C. Marshall Space Flight
Center, National Aeronautics and Space
Administration.

7. Use small multiples


Trellis/Lattice/Facets
They are high resolution and easy on the viewer, because once the
viewer figures out one frame, they can figure out all the rest based upon
what they have learned.

They have an inherent credibility with the viewer because they show
a lot of data – “I know what I’m talking about and I’m showing all my
data to you.”

Keep the underlying design of small multiples simple and clear.


16 R plotting

Anscombe's quartet
1 2
12.5

10.0 ●

● ● ●
● ● ●

● ● ●
7.5 ●
● ●


5.0 ● ●

y
3 4

12.5 ●

10.0
● ●


● ●

7.5 ●




● ●

● ●
● ●

5.0

5 10 15 5 10 15
x

Numbers have meaning.

Use numbers or a graph that represents them.

Don’t reduce quantities to on/off, yes/no, here/not.


The principles of information design are the principles of reasoning
about evidence. It is visual thinking. Good design is a lot like clear
thinking, made visible.

The converse is also true. Bad design is stupidity made visible. If a


chart has three phony dimensions to compare four numbers it shows the
person doesn’t know what they are talking about.

Start by asking, what is the intellectual task that this display is sup-
posed to help with?

Examples of “Bad”
are easy to find
17
18 R plotting
19
20 R plotting

Beautiful, informative
plots in R
Introduction to the
ggplot package.
Plotting with ggplot2
Beautiful plots made simple
# only needed once after installing or upgrading R
install.packages("ggplot2")
# each time you start R
# load ggplot2 functions and datasets
21

library(ggplot2)

# ggplot2 includes a dataset "mpg"

# ? gives help on a function or dataset


?mpg

# head() lists the first several rows of a data.frame


head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

# str() gives the structure of the object


str(mpg)
## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

# summary() gives frequeny tables for categorical variables


# and mean and five-number summaries for continuous variables
summary(mpg)
## manufacturer model displ
## dodge :37 caravan 2wd : 11 Min. :1.600
## toyota :34 ram 1500 pickup 4wd: 10 1st Qu.:2.400
## volkswagen:27 civic : 9 Median :3.300
## ford :25 dakota pickup 4wd : 9 Mean :3.472
## chevrolet :19 jetta : 9 3rd Qu.:4.600
## audi :18 mustang : 9 Max. :7.000
## (Other) :74 (Other) :177
## year cyl trans drv
## Min. :1999 Min. :4.000 auto(l4) :83 4:103
## 1st Qu.:1999 1st Qu.:4.000 manual(m5):58 f:106
## Median :2004 Median :6.000 auto(l5) :39 r: 25
## Mean :2004 Mean :5.889 manual(m6):19
## 3rd Qu.:2008 3rd Qu.:8.000 auto(s6) :16
## Max. :2008 Max. :8.000 auto(l6) : 6
## (Other) :13
## cty hwy fl class
## Min. : 9.00 Min. :12.00 c: 1 2seater : 5
## 1st Qu.:14.00 1st Qu.:18.00 d: 5 compact :47
## Median :17.00 Median :24.00 e: 8 midsize :41
## Mean :16.86 Mean :23.44 p: 52 minivan :11
22 R plotting

## 3rd Qu.:19.00 3rd Qu.:27.00 r:168 pickup :33


## Max. :35.00 Max. :44.00 subcompact:35
## suv :62

ggplot()
# specify the dataset and variables
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point() # add a plot layer with points
print(p)

40





● ●
● ●
● ●●

30 ● ● ●
● ● ● ● ●● ●
hwy

● ● ● ● ●
● ● ●● ● ● ● ●
● ● ● ●● ● ●● ●● ● ● ● ●
● ● ● ●● ● ● ● ●
●● ●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ●

20 ● ● ●●
● ● ● ●● ●
● ● ● ● ● ●●
●● ●● ● ●● ● ● ● ● ● ●
●● ● ●
● ●● ●●● ● ●
● ●

2 3 4 5 6 7
displ

Additional variables
Aesthetics and faceting

Geom: is the “type” of plot

Aesthetics: shape, colour, size, alpha

Faceting: “small multiples” displaying different subsets

Help is available. Try searching for examples, too.


ˆ docs.ggplot2.org/current/
23

ˆ docs.ggplot2.org/current/geom_point.html

Aesthetics
The legend is chosen and displayed automatically
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point(aes(colour = class))
print(p)

40





class
● ●
● ● ● 2seater
● ● ●
● compact
30 ● ● ●
● ● ● ● ● ● ● ● midsize
hwy

● ● ● ● ●
● ● ● ● ● ● ● ●
● minivan
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● pickup
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● subcompact
● ● ● ● ● ● ●
● suv
● ● ● ● ●
● ●

20 ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●

2 3 4 5 6 7
displ

1. Assign variables to aesthetics colour, size, and shape.

2. What’s the difference between discrete or continuous variables?

3. What happens when you combine multiple aesthetics?

Aesthetics
Behavior
24 R plotting

AestheticDiscrete Continuous
colour Rainbow of col- Gradient from
ors red to blue
size Discrete size Linear mapping
steps between radius
and value
shape Different shape Shouldn’t work
for each

p <- ggplot(mpg, aes(x = displ, y = hwy))


p <- p + geom_point(aes(colour = class, size = cyl, shape = drv))
print(p)

drv
● 4
40 f
r

cyl

4
● 5
● 6
30
● 7
hwy

●8

● ●

● ● ●

● ●
● ● class




● ● 2seater
● compact
20 ●
● ● midsize
● ● ● ●● ● ● minivan
● ● ●● ●●
●● ●● ● ●● ●●● ● ● ● pickup
●● ●
● ●● ●● ●● ● subcompact
● ● ● suv


2 3 4 5 6 7
displ

p <- ggplot(mpg, aes(x = displ, y = hwy))


p <- p + geom_point(aes(colour = class, size = cyl, shape = drv), alpha = 1/4) # alpha is opacity
print(p)
25

drv
4
40 f
r

cyl
4
5
6
30
7
hwy

class
2seater
compact
20 midsize
minivan
pickup
subcompact
suv

2 3 4 5 6 7
displ

Faceting

ˆ Small multiples displaying different subsets of the data.

ˆ Useful for exploring conditional relationships. Useful for large data.

Faceting in many ways


facet_grid(rows ~ cols): 2D grid, “.” for no split
facet_wrap(~ var): 1D ribbon wrapped into 2D
p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
p1 <- p + facet_grid(. ~ cyl)
p2 <- p + facet_grid(drv ~ .)
p3 <- p + facet_grid(drv ~ cyl)
p4 <- p + facet_wrap(~ class)
print(p1) # print each to see

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()


p1 <- p + facet_grid(. ~ cyl)
print(p1)
26 R plotting

4 5 6 8

40





●●
● ●
●●

30 ●● ●
●●●●● ● ●
hwy

● ● ●●●
●●●
● ●●●●
●●●●
● ●●●
●●●●● ● ●
● ● ●●
● ●● ● ●

● ●
●●● ● ●
● ● ●● ●● ●
● ●● ● ●
● ●

20 ● ● ●

●●● ●
● ●
●● ●● ●●


● ●
●● ●
●●●●●
●● ●

● ●●
● ●
● ●
●●●●
● ●

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()


p2 <- p + facet_grid(drv ~ .)
print(p2)

40

30

4

● ●
● ● ●
● ● ● ●
● ●
● ●
● ●
20 ●
● ●

● ●● ●
● ● ● ● ●●
●● ●● ● ●● ● ● ● ● ●
●● ●
● ●● ● ● ● ●
● ●



40




● ●
● ●
● ●●
hwy

30 ●



● ●

●● ●
● ● ● ● ●
f

● ● ● ● ● ● ●
● ●● ● ●● ●● ●
● ● ●
● ●● ● ●
● ● ●
● ●

20

40

30
r

● ● ● ●
● ●
● ●
● ●


20 ●●

● ● ● ●

2 3 4 5 6 7
displ

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()


p3 <- p + facet_grid(drv ~ cyl)
print(p3)
27

4 5 6 8

40

30

4
● ●
● ●●
● ● ●●
● ●
● ●
● ●
20 ●
●●●


● ●
●● ●● ●


● ●●● ●
●●●●● ●

● ●
● ●● ●●●●
● ●



40




●●
● ●
●●●

hwy
30 ●● ●
●●●●● ● ●
● ● ●●●

f
●●● ●●●●
●● ●●●●●●●
● ● ●
● ●
●●●
● ●●
● ●

20

40

30

r
●● ● ●
● ●
● ●
● ●


20 ●


● ●●●

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()


p4 <- p + facet_wrap(~ class)
print(p4)

2seater compact midsize


40




● ● ●●
30 ●●
●●●●●

●●●



●● ●
●●●● ●● ●● ●●
● ● ●● ●● ●● ●●● ●●●
● ● ●●● ●● ● ●
● ● ●
● ● ●
20

minivan pickup subcompact




40




hwy

30 ● ● ●
● ●

●● ● ●●

● ●● ● ●
●● ●
● ● ● ●
● ●
20 ●
●●
●●



● ● ● ● ●● ●● ●
●● ●
●● ●

suv

40

30






20 ● ●
● ● ● ●● ●
●●

●●●
●● ●● ●● ● ● ● ● ●
●●
● ●● ● ●●
● ●

2 3 4 5 6 7
displ

Improving plots
p <- ggplot(mpg, aes(x = cty, y = hwy))
p <- p + geom_point()
print(p)
28 R plotting

● ●

40


● ●


● ●
● ● ●
● ● ● ●

30 ● ● ●
● ● ● ● ● ●
hwy

● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ●

20 ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●

10 15 20 25 30 35
cty

jitter
p <- ggplot(mpg, aes(x = cty, y = hwy))
p <- p + geom_point(position = 'jitter')
print(p)

● ●


40




● ●







● ● ●
● ●●

● ● ●
30 ●
●●● ● ● ● ●
●●●
●●● ●● ●

hwy

●●●●
● ● ●● ●
●●

●●● ● ●● ●
● ●● ● ● ●
●●●
● ●● ●● ●●

●●● ●●
●● ● ●

●● ●● ●
● ●
●● ● ●
● ● ● ●● ●● ● ●
● ●
● ● ●
● ● ● ●● ●●● ●
●●●●●
● ●
● ●●
●●
● ●
●●
● ●●●
20 ●
●● ● ●
● ●
●●● ● ●●● ●●
● ●
●●

●● ●
●● ● ●● ●
● ● ● ●●

● ●●
● ●
●● ●
● ● ● ●
● ● ● ●●●● ● ● ●
●● ● ●● ●

●●●



●●
● ●

●●
●●

10 20 30
cty

p <- ggplot(mpg, aes(x = class, y = hwy))


p <- p + geom_point()
print(p)
29

● ●

40





● ●
● ●
● ●

30 ● ●
● ● ●

hwy
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●

20 ● ● ●
● ●
● ●
● ● ●
● ●
● ●

● ●

2seater compact midsize minivan pickup subcompact suv


class

reorder
reordering the class variable by the mean hwy
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point()
print(p)

● ●

40





● ●
● ●
● ●

30 ● ●
● ● ●
hwy

● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●

20 ● ● ●
● ●
● ●
● ● ●
● ●
● ●

● ●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)
30 R plotting

reorder

and jitter
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point(position = 'jitter')
print(p)

● ●

40





● ●

● ● ●

●● ● ●●

30 ● ● ●

● ●●
●● ● ● ●● ● ● ●
●● ●●●
● ● ●
hwy

● ●
● ●
● ● ● ●
● ● ●●
● ● ●●● ● ● ●

● ●● ●● ● ● ●●
● ●
●●●●● ●
●● ●●● ● ●● ● ●
●● ● ●
● ● ● ● ● ● ●
● ●●
● ●

● ● ● ● ●
●● ● ● ● ●
● ●

● ● ●
● ● ●
● ● ●●
● ●●


● ●
20 ● ● ●● ● ● ● ●

● ● ●
●● ●
●●● ●●
● ●
● ● ●● ●
● ● ●●
● ●● ●●● ●● ●●
●● ● ●
● ●● ●● ●●●

●●●

●●● ●
● ●
● ●● ●●
● ●● ●
● ● ● ●
● ●

● ● ● ●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)

reorder

and jitter (a little less)


p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_jitter(position = position_jitter(width = .1))
print(p)
31



40


●●
●●


● ●
●●●


● ●
● ●●
●● ●
30 ●

●● ●●

● ● ●●
● ●●●●
● ●●

hwy
● ●
● ●
●● ● ● ●
● ●●●● ● ●●
● ●● ●
● ● ●●●● ●●● ●
● ●
●● ●
● ●●
● ●●●
● ●

●●●

● ● ● ●

● ● ● ● ●● ●

● ●●● ● ● ● ●●
● ●
● ●
● ●
● ●● ● ●
● ● ●● ●


● ●
● ●
20 ● ●● ●● ●
● ●

●●● ●●
●●●●
● ●●●●
● ●

●●●●● ● ●●
●●
●●●
● ● ●



●●●●



● ●

● ●●

● ●
● ●

●●
● ●●●●

●●

●●

10
pickup suv minivan 2seater midsize subcompact compact
reorder(class, hwy)

reorder
and boxplot
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_boxplot()
print(p)

● ●

40

30
hwy






20

● ●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)
32 R plotting

reorder

and jitter and boxplot


p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_jitter(position = position_jitter(width = .1))
p <- p + geom_boxplot()
print(p)


●● ●

●●

40

●●



●●


● ●
●●
● ●
●● ●●
●●
● ●
30 ●

● ●
●●● ●●
● ● ●●●
● ●
●●●
hwy

● ●●
●●● ● ●
● ●
●● ●● ●●●
●●● ●●
● ●●●
● ●●●● ●●
● ● ● ●
●● ●●●● ● ●●
● ●
●●● ● ●

● ●
●● ● ●● ●

● ●
●●
●●●
● ● ●
●● ●
● ● ●
● ●● ● ●●
● ● ● ●
●●
●● ● ●● ●
● ●
● ●
●● ●
20 ● ●
●●●●● ●

●●●
● ●●●
●●●●

● ●●

●●●●●
●●
● ●●
●●●
● ●●● ●
● ●●

●●●●●
● ●●●●
●●

●●
●●
● ●

●● ●●


● ● ●●
●●



●● ●●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)

reorder by median

and jitter and boxplot alpha


p <- ggplot(mpg, aes(x = reorder(class, hwy, FUN = median), y = hwy))
p <- p + geom_jitter(position = position_jitter(width = .1))
p <- p + geom_boxplot(alpha = 0.5)
print(p)
33

●● ●

●●

40





●●


● ●
●●


●● ●●
● ●

●●
30 ●


● ● ●

●●● ●●

●● ●●●
● ● ●

hwy
●● ● ●

●●
● ● ● ●
● ●●
●●
● ●●●●

●●
● ●●●
● ●●
●● ●

● ●
● ●●● ●●● ● ●●
● ● ●●●
● ● ●
●● ●●
●● ● ● ●●

● ● ●● ●● ●
● ●●● ● ●●

●● ●● ● ● ●
● ●
● ●●
● ●●
● ●
●● ●
20 ●●● ● ●
●●●
●● ●●
●● ●●
●●●
● ●●●
● ●●●●●

●●●● ●●
●●
● ●
●●●●●● ● ●●
●● ●
●●
●●●
●●
●● ●●
● ●

●●● ●●●
●●
● ●
●●

●●
●● ●
●●

pickup suv minivan 2seater subcompact compact midsize


reorder(class, hwy, FUN = median)

reorder by median
and boxplot and jitter (switched order)
p <- ggplot(mpg, aes(x = reorder(class, hwy, FUN = median), y = hwy))
p <- p + geom_boxplot(alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = .1))
print(p)

● ●●



40






●●

● ●
● ●
●●
●●● ●
●●●
● ●

30 ●
●● ● ●
●●● ● ●
●● ● ●●
● ●● ●●

hwy

● ● ● ●●
● ●
●● ● ●●
●● ● ●
●●
● ●●● ●

● ●●
●●● ●● ●●●
●●

● ● ● ● ●
●●●●
● ●●
● ● ●●

●● ●● ●●●
●● ●
● ●●
●● ●● ●● ●
●● ● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●●
● ●
● ●

● ●
20 ●
●●

●●●
● ●● ●●●
●●
● ●●●
●● ●●●●

●●
●●●
● ●●
● ●●●
●●●●
● ● ●
● ●●
● ● ●● ●
●●

●●● ●●
●●
● ●


●● ●
●●●
●●

● ●
●●
●●

pickup suv minivan 2seater subcompact compact midsize


reorder(class, hwy, FUN = median)
34 R plotting

Read Edward Tufte’s books.


Explore visualization online.
Strive for clear, effective visual communication.

You might also like