R Tutorial

R Tutorial
Kelly Black
Department of Mathematics and Computer Science
Clarkson University
Contents:
1. Input
2. Basic Data Types
3. Basic Operations and Numerical Descriptions
4. Basic Probability Distributions
5. Basic Plots
6. Intermediate Plotting
7. Indexing Into Vectors
8. Linear Least Squares Regression
9. Calculating Confidence Intervals
10. Calculating p Values
11. Calculating the Power of a Test
12. Two Way Tables
13. Data Management
14. Scripting
15. Time Data Types
16. Case Study: Working Through a HW Problem
17. Case Study II: A JAMA Paper on Cholesterol
Indices and tables
Index
Search Page
1. Input
Contents
Assignment
Reading a CSV file
Brief Note on Fixed Width Files
Here we explore how to define a data set in an R session. Only two commands are explored.
The first is for simple assignment of data, and the second is for reading in a data file. There
are many ways to read data into an R session, but we focus on just two to keep it simple.

1.1. Assignment
The most straight forward way to store a list of numbers is through an assignment using the c
command. (c stands for combine.) The idea is that a list of numbers is stored under a given
name, and the name is used to refer to the data. A list is specified with the c command, and
assignment is specified with the <- symbols. Another term used to describe the list of
numbers is to call it a vector.
The numbers within the c command are separated by commas. As an example, we can create
a new variable, called bubba which will contain the numbers 3, 5, 7, and 9:
> bubba <- c(3,5,7,9)
>
When you enter this command you should not see any output except a new command line.
The command creates a list of numbers called bubba. To see what numbers is included in
bubba type bubba and press the enter key:
> bubba
[1] 3 5 7 9
If you wish to work with one of the numbers you can get access to it using the variable and
then square brackets indicating which number:
> bubba[2]
[1] 5
> bubba[1]
[1] 3
> bubba[0]
numeric(0)
> bubba[3]
[1] 7
> bubba[4]
[1] 9
Notice that the first entry is referred to as the number 1 entry, and the zero entry can be used
to indicate how the computer will treat the data. You can store strings using both single and
double quotes, and you can store real numbers.
You now have a list of numbers and are ready to explore. In the chapters that follow we will
examine the basic operations in R that will allow you to do some of the analyses required in
class.

1.2. Reading a CSV file
Unfortunately, it is rare to have just a few data points that you do not mind typing in at the
prompt. It is much more common to have a lot of data points with complicated relationships.
Here we will examine how to read a data set from a file using the read.csv function but first
discuss the format of a data file.
We assume that the data file is in the format called comma separated values (csv). That is,
each line contains a row of values which can be numbers or letters, and each value is
separated by a comma. We also assume that the very first row contains a list of labels. The
idea is that the labels in the top row are used to refer to the different columns of values.
First we read a very short, somewhat silly, data file. The data file is called simple.csv and has
three columns of data and six rows. The three columns are labeled trial, mass, and
velocity. We can pretend that each row comes from an observation during one of two trials
labeled A and B. A copy of the data file is shown below and is created in defiance of
Werner Heisenberg:
silly.csv
trial mass velocity
A 10 12
A 11 14
B 5 8
B 6 10
A 10.5 13
B 7 11
The command to read the data file is read.csv. We have to give the command at least one
arguments, but we will give three different arguments to indicate how the command can be
used in different situations. The first argument is the name of file. The second argument
indicates whether or not the first row is a set of labels. The third argument indicates that there
is a comma between each number of each line. The following command will read in the data
and assign it to a variable called heisenberg:
> heisenberg <- read.csv(file="simple.csv",head=TRUE,sep=",")
> heisenberg
trial mass velocity
1 A 10.0 12
2 A 11.0 14
3 B 5.0 8
4 B 6.0 10
5 A 10.5 13
6 B 7.0 11
> summary(heisenberg)
trial mass velocity
A:3 Min. : 5.00 Min. : 8.00
B:3 1st Bu.: 6.25 1st Qu.:10.25
Median : 8.50 Median :11.50
Mean : 8.25 Mean :11.33
3rd Qu.:10.38 3rd Qu.:12.75
Max. :11.00 Max. :14.00
(Note that if you are using a Microsoft system the file naming convention is different from
what we use here. If you want to use a backslash it needs to be escaped, i.e. use two
backslashes together \. Also you can specify what folder to use by clicking on the File
option in the main menu and choose the option to specify your working directory.)
To get more information on the different options available you can use the help command:
> help(read.csv)
If R is not finding the file you are trying to read then it may be looking in the wrong
folder/directory. If you are using the graphical interface you can change the working
directory from the file menu. If you are not sure what files are in the current working
directory you can use the dir() command to list the files and the getwd() command to
determine the current working directory:
> dir()
[1] "fixedWidth.dat" "simple.csv" "trees91.csv" "trees91.wk1"
[5] "w1.dat"
> getwd()
[1] "/home/black/write/class/stat/stat383-13F/dat"
The variable heisenberg contains the three columns of data. Each column is assigned a
name based on the header (the first line in the file). You can now access each individual
column using a $ to separate the two names:
> heisenberg$trial
[1] A A B B A B
Levels: A B
> heisenberg$mass
[1] 10.0 11.0 5.0 6.0 10.5 7.0
> heisenberg$velocity
[1] 12 14 8 10 13 11
If you are not sure what columns are contained in the variable you can use the names
command:
> names(heisenberg)
[1] "trial" "mass" "velocity"
We will look at another example which is used throughout this tutorial. we will look at the
data found in a spreadsheet located at http://cdiac.ornl.gov/ftp/ndp061a/trees91.wk1 . A
description of the data file is located at http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.txt . The
original data is given in an excel spreadsheet. It has been converted into a csv file, trees91.csv
, by deleting the top set of rows and saving it as a csv file. This is an option to save within
excel. (You should save the file on your computer.) It is a good idea to open this file in a
spreadsheet and look at it. This will help you make sense of how R stores the data.
The data is used to indicate an estimate of biomass of ponderosa pine in a study performed by
Dale W. Johnson, J. Timothy Ball, and Roger F. Walker who are associated with the
Biological Sciences Center, Desert Research Institute, P.O. Box 60220, Reno, NV 89506 and
the Environmental and Resource Sciences College of Agriculture, University of Nevada,
Reno, NV 89512. The data is consists of 54 lines, and each line represents an observation.
Each observation includes measurements and markers for 28 different measurements of a
given tree. For example, the first number in each row is a number, either 1, 2, 3, or 4, which
signifies a different level of exposure to carbon dioxide. The sixth number in every row is an
estimate of the biomass of the stems of a tree. Note that the very first line in the file is a list of
labels used for the different columns of data.
The data can be read into a variable called tree in using the read.csv command:
> tree <- read.csv(file="trees91.csv",header=TRUE,sep=",");
This will create a new variable called tree. If you type in tree at the prompt and hit enter,
all of the numbers stored in the variable will be printed out. Try this, and you should see that
it is difficult to make any sense out of the numbers.
There are many different ways to keep track of data in R. When you use the read.csv
command R uses a specific kind of variable called a data frame. All of the data are stored
within the data frame as separate columns. If you are not sure what kind of variable you have
then you can use the attributes command. This will list all of the things that R uses to
describe the variable:
> attributes(tree)
$names
[1] "C" "N" "CHBR" "REP" "LFBM" "STBM" "RTBM" "LFNCC"
[9] "STNCC" "RTNCC" "LFBCC" "STBCC" "RTBCC" "LFCACC" "STCACC" "RTCACC"
[17] "LFKCC" "STKCC" "RTKCC" "LFMGCC" "STMGCC" "RTMGCC" "LFPCC" "STPCC"
[25] "RTPCC" "LFSCC" "STSCC" "RTSCC"

$class
[1] "data.frame"

$row.names
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
"15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
"30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"
"45"
[46] "46" "47" "48" "49" "50" "51" "52" "53" "54"
The first thing that R stores is a list of names which refer to each column of the data. For
example, the first column is called C, the second column is called N. Tree is of type
data.frame. Finally, the rows are numbered consecutively from 1 to 54. Each column has 54
numbers in it.
If you know that a variable is a data frame but are not sure what labels are used to refer to the
different columns you can use the names command:
> names(tree)
[9] "STNCC" "RTNCC" "LFBCC" "STBCC" "RTBCC" "LFCACC" "STCACC"
"RTCACC"
If you want to work with the data in one of the columns you give the name of the data frame,
a $ sign, and the label assigned to the column. For example, the first column in tree can be
called using tree$C:
> tree$C
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3
3 3 3
[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4

1.3. Brief Note on Fixed Width Files
There are many ways to read data using R. We only give two examples, direct assignment
and reading csv files. However, another way deserves a brief mention. It is common to come
across data that is organized in flat files and delimited at preset locations on each line. This is
often called a fixed width file.
The command to deal with these kind of files is read.fwf. Examples of how to use this
command are not explored here, but a brief example is given. If you would like more
information on how to use this command enter the following command:
> help(read.fwf)
The read.fwf command requires at least two options. The first is the name of the file and the
second is a list of numbers that gives the length of each column in the data file. A negative
number in the list indicates that the column should be skipped. Here we give the command to
read the data file fixedWidth.dat . In this data file there are three columns. The first colum is
17 characters wide, the second column is 15 characters wide, and the last column is 7
characters wide. In the example below we use the optional col.names option to specify the
names of the columns:
> a = read.fwf('fixedWidth.dat',widths=c(-
17,15,7),col.names=c('temp','offices'))
> a
temp offices
1 17.0 35
2 18.0 117
3 17.5 19
4 17.5 28
2. Basic Data Types
Contents
Variable Types
Tables
We look at some of the ways that R can store and organize data. This is a basic introduction
to a small subset of the different data types recognized by R and is not comprehensive in any
sense. The main goal is to demonstrate the different kinds of information R can handle. It is
assumed that you know how to enter data or read data files which is covered in the first
chapter.

2.1. Variable Types

2.1.1. Numbers
The way to work with real numbers has already been covered in the first chapter and is
briefly discussed here. The most basic way to store a number is to make an assignment of a
single number:
> a <- 3
>
The <- tells R to take the number to the right of the symbol and store it in a variable whose
name is given on the left. You can also use the = symbol. When you make an assignment R
does not print out any information. If you want to see what value a variable has just type the
name of the variable on a line and press the enter key:
> a
[1] 3
This allows you to do all sorts of basic operations and save the numbers:
> b <- sqrt(a*a+3)
> b
[1] 3.464102
If you want to get a list of the variables that you have defined in a particular session you can
list them all using the ls command:
> ls()
[1] "a" "b"
You are not limited to just saving a single number. You can create a list (also called a
vector) using the c command:
> a <- c(1,2,3,4,5)
> a
[1] 1 2 3 4 5
> a+1
[1] 2 3 4 5 6
> mean(a)
[1] 3
> var(a)
[1] 2.5
You can get access to particular entries in the vector in the following manner:
> a <- c(1,2,3,4,5)
> a[1]
[1] 1
> a[2]
[1] 2
> a[0]
numeric(0)
> a[5]
[1] 5
> a[6]
[1] NA
Note that the zero entry is used to indicate how the data is stored. The first entry in the vector
is the first number, and if you try to get a number past the last number you get NA.
Examples of the sort of operations you can do on vectors is given in a next chapter.
To initialize a list of numbers the numeric command can be used. For example, to create a list
of 10 numbers, initialized to zero, use the following command:
> a <- numeric(10)
> a
[1] 0 0 0 0 0 0 0 0 0 0
If you wish to determine the data type used for a variable the type command:
> typeof(a)
[1] "double"
2.1.2. Strings
You are not limited to just storing numbers. You can also store strings. A string is specified
by using quotes. Both single and double quotes will work:
> a <- "hello"
> a
[1] "hello"
> b <- c("hello","there")
> b
[1] "hello" "there"
> b[1]
[1] "hello"
The name of the type given to strings is character,
> typeof(a)
[1] "character"
> a = character(20)
> a
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
2.1.3. Factors
Another important way R can store data is as a factor. Often times an experiment includes
trials for different levels of some explanatory variable. For example, when looking at the
impact of carbon dioxide on the growth rate of a tree you might try to observe how different
trees grow when exposed to different preset concentrations of carbon dioxide. The different
levels are also called factors.
Assuming you know how to read in a file, we will look at the data file given in the first
chapter. Several of the variables in the file are factors:
> summary(tree$CHBR)
A1 A2 A3 A4 A5 A6 A7 B1 B2 B3 B4 B5 B6 B7 C1 C2 C3 C4 C5
C6
3 1 1 3 1 3 1 1 3 3 3 3 3 3 1 3 1 3 1
1
C7 CL6 CL7 D1 D2 D3 D4 D5 D6 D7
1 1 1 1 1 3 1 1 1 1
Because the set of options given in the data file corresponding to the CHBR column are not
all numbers R automatically assumes that it is a factor. When you use summary on a factor it
does not print out the five point summary, rather it prints out the possible values and the
frequency that they occur.
In this data set several of the columns are factors, but the researchers used numbers to
indicate the different levels. For example, the first column, labeled C, is a factor. Each trees
was grown in an environment with one of four different possible levels of carbon dioxide.
The researchers quite sensibly labeled these four environments as 1, 2, 3, and 4.
Unfortunately, R cannot determine that these are factors and must assume that they are
regular numbers.
This is a common problem and there is a way to tell R to treat the C column as a set of
factors. You specify that a variable is a factor using the factor command. In the following
example we convert tree$C into a factor:
> tree$C
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3
3 3 3
[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
> summary(tree$C)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.519 3.000 4.000
> tree$C <- factor(tree$C)
> tree$C
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3
3 3 3
[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
Levels: 1 2 3 4
> summary(tree$C)
1 2 3 4
8 23 10 13
> levels(tree$C)
[1] "1" "2" "3" "4"
Once a vector is converted into a set of factors then R treats it in a different manner then
when it is a set of numbers. A set of factors have a decrete set of possible values, and it does
not make sense to try to find averages or other numerical descriptions. One thing that is
important is the number of times that each factor appears, called their frequencies, which is
printed using the summary command.

2.1.4. Data Frames
Another way that information is stored is in data frames. This is a way to take many vectors
of different types and store them in the same variable. The vectors can be of all different
types. For example, a data frame may contain many lists, and each list might be a list of
factors, strings, or numbers.
There are different ways to create and manipulate data frames. Most are beyond the scope of
this introduction. They are only mentioned here to offer a more complete description. Please
see the first chapter for more information on data frames.
One example of how to create a data frame is given below:
> a <- c(1,2,3,4)
> b <- c(2,4,6,8)
> levels <- factor(c("A","B","A","B"))
> bubba <- data.frame(first=a,
second=b,
f=levels)
> bubba
first second f
1 1 2 A
2 2 4 B
3 3 6 A
4 4 8 B
> summary(bubba)
first second f
Min. :1.00 Min. :2.0 A:2
1st Qu.:1.75 1st Qu.:3.5 B:2
Median :2.50 Median :5.0
Mean :2.50 Mean :5.0
3rd Qu.:3.25 3rd Qu.:6.5
Max. :4.00 Max. :8.0
> bubba$first
[1] 1 2 3 4
> bubba$second
[1] 2 4 6 8
> bubba$f
[1] A B A B
Levels: A B
2.1.5. Logical
Another important data type is the logical type. There are two predefined variables, TRUE
and FALSE:
> a = TRUE
> typeof(a)
[1] "logical"
> b = FALSE
> typeof(b)
[1] "logical"
The standard logical operators can be used:
< less than
> great than
<= less than or equal
>= greater than or equal
== equal to
!= not equal to
| entry wise or
|| or
! not
& entry wise and
&& and
xor(a,b) exclusive or
Note that there is a difference between operators that act on entries within a vector and the
whole vector:
> a = c(TRUE,FALSE)
> b = c(FALSE,FALSE)
> a|b
[1] TRUE FALSE
> a||b
[1] TRUE
> xor(a,b)
[1] TRUE FALSE
There are a large number of functions that test to determine the type of a variable. For
example the is.numeric function can determine if a variable is numeric:
> a = c(1,2,3)
> is.numeric(a)
[1] TRUE
> is.factor(a)
[1] FALSE

2.2. Tables
Another common way to store information is in a table. Here we look at how to define both
one way and two way tables. We only look at how to create and define tables; the functions
used in the analysis of proportions are examined in another chapter.

2.2.1. One Way Tables
The first example is for a one way table. One way tables are not the most interesting example,
but it is a good place to start. One way to create a table is using the table command. The
arguments it takes is a vector of factors, and it calculates the frequency that each factor
occurs. Here is an example of how to create a one way table:
> a <- factor(c("A","A","B","A","B","B","C","A","C"))
> results <- table(a)
> results
a
A B C
4 3 2
> attributes(results)
$dim
[1] 3

$dimnames
$dimnames$a
[1] "A" "B" "C"

$class
[1] "table"

> summary(results)
Number of cases in table: 9
Number of factors: 1
If you know the number of occurrences for each factor then it is possible to create the table
directly, but the process is, unfortunately, a bit more convoluted. There is an easier way to
define one-way tables (a table with one row), but it does not extend easily to two-way tables
(tables with more than one row). You must first create a matrix of numbers. A matrix is like a
vector in that it is a list of numbers, but it is different in that you can have both rows and
columns of numbers. For example, in our example above the number of occurrences of A is
4, the number of occurrences of B is 3, and the number of occurrences of C is 2. We will
create one row of numbers. The first column contains a 4, the second column contains a 3,
and the third column contains a 2:
> occur <- matrix(c(4,3,2),ncol=3,byrow=TRUE)
> occur
[,1] [,2] [,3]
[1,] 4 3 2
At this point the variable occur is a matrix with one row and three columns of numbers. To
dress it up and use it as a table we would like to give it labels for each columns just like in the
previous example. Once that is done we convert the matrix to a table using the as.table
command:
> colnames(occur) <- c("A","B","C")
> occur
A B C
[1,] 4 3 2
> occur <- as.table(occur)
> occur
A B C
A 4 3 2
> attributes(occur)
$dim
[1] 1 3

$dimnames
$dimnames[[1]]
[1] "A"

$dimnames[[2]]
[1] "A" "B" "C"

$class
[1] "table"

2.2.2. Two Way Tables
If you want to add rows to your table just add another vector to the argument of the table
command. In the example below we have two questions. In the first question the responses
are labeled Never, Sometimes, or Always. In the second question the responses are
labeled Yes, No, or Maybe. The set of vectors a, and b, contain the response for
each measurement. The third item in a is how the third person responded to the first
question, and the third item in b is how the third person responded to the second question.
> a <-
c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes"
,"Never")
> b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
> results <- table(a,b)
> results
b
a Maybe No Yes
Always 2 0 0
Never 0 1 1
Sometimes 2 1 1
The table command allows us to do a very quick calculation, and we can immediately see that
two people who said Maybe to the first question also said Sometimes to the second
question.
Just as in the case with one-way tables it is possible to manually enter two way tables. The
procedure is exactly the same as above except that we now have more than one row. We give
a brief example below to demonstrate how to enter a two-way table that includes breakdown
of a group of people by both their gender and whether or not they smoke. You enter all of the
data as one long list but tell R to break it up into some number of columns:
> sexsmoke<-matrix(c(70,120,65,140),ncol=2,byrow=TRUE)
> rownames(sexsmoke)<-c("male","female")
> colnames(sexsmoke)<-c("smoke","nosmoke")
> sexsmoke <- as.table(sexsmoke)
> sexsmoke
smoke nosmoke
male 70 120
female 65 140
The matrix command creates a two by two matrix. The byrow=TRUE option indicates that
the numbers are filled in across the rows first, and the ncols=2 indicates that there are two
columns.

3. Basic Operations and Numerical
Descriptions
Contents
Basic Operations
Basic Numerical Descriptions
Operations on Vectors
We look at some of the basic operations that you can perform on lists of numbers. It is
assumed that you know how to enter data or read data files which is covered in the first
chapter, and you know about the basic data types.

3.1. Basic Operations
Once you have a vector (or a list of numbers) in memory most basic operations are available.
Most of the basic operations will act on a whole vector and can be used to quickly perform a
large number of calculations with a single command. There is one thing to note, if you
perform an operation on more than one vector it is often necessary that the vectors all contain
the same number of entries.
Here we first define a vector which we will call a and will look at how to add and subtract
constant numbers from all of the numbers in the vector. First, the vector will contain the
numbers 1, 2, 3, and 4. We then see how to add 5 to each of the numbers, subtract 10 from
each of the numbers, multiply each number by 4, and divide each number by 5.
> a <- c(1,2,3,4)
> a
[1] 1 2 3 4
> a + 5
[1] 6 7 8 9
> a - 10
[1] -9 -8 -7 -6
> a*4
[1] 4 8 12 16
> a/5
[1] 0.2 0.4 0.6 0.8
We can save the results in another vector called b:
> b <- a - 10
> b
[1] -9 -8 -7 -6
If you want to take the square root, find e raised to each number, the logarithm, etc., then the
usual commands can be used:
> sqrt(a)
[1] 1.000000 1.414214 1.732051 2.000000
> exp(a)
[1] 2.718282 7.389056 20.085537 54.598150
> log(a)
[1] 0.0000000 0.6931472 1.0986123 1.3862944
> exp(log(a))
[1] 1 2 3 4
By combining operations and using parentheses you can make more complicated expressions:
> c <- (a + sqrt(a))/(exp(2)+1)
> c
[1] 0.2384058 0.4069842 0.5640743 0.7152175
Note that you can do the same operations with vector arguments. For example to add the
elements in vector a to the elements in vector b use the following command:
> a + b
[1] -8 -6 -4 -2
The operation is performed on an element by element basis. Note this is true for almost all of
the basic functions. So you can bring together all kinds of complicated expressions:
> a*b
[1] -9 -16 -21 -24
> a/b
[1] -0.1111111 -0.2500000 -0.4285714 -0.6666667
> (a+3)/(sqrt(1-b)*2-1)
[1] 0.7512364 1.0000000 1.2884234 1.6311303
You need to be careful of one thing. When you do operations on vectors they are performed
on an element by element basis. One ramification of this is that all of the vectors in an
expression must be the same length. If the lengths of the vectors differ then you may get an
error message, or worse, a warning message and unpredictable results:
> a <- c(1,2,3)
> b <- c(10,11,12,13)
> a+b
[1] 11 13 15 14
Warning message:
longer object length
is not a multiple of shorter object length in: a + b
As you work in R and create new vectors it can be easy to lose track of what variables you
have defined. To get a list of all of the variables that have been defined use the ls() command:
> ls()
[1] "a" "b" "bubba" "c"
"last.warning"
[6] "tree" "trees"
Finally, you should keep in mind that the basic operations almost always work on an element
by element basis. There are rare exceptions to this general rule. For example, if you look at
the minimum of two vectors using the min command you will get the minimum of all of the
numbers. There is a special command, called pmin, that may be the command you want in
some circumstances:
> a <- c(1,-2,3,-4)
> b <- c(-1,2,-3,4)
> min(a,b)
[1] -4
> pmin(a,b)
[1] -1 -2 -3 -4

3.2. Basic Numerical Descriptions
Given a vector of numbers there are some basic commands to make it easier to get some of
the basic numerical descriptions of a set of numbers. Here we assume that you can read in the
tree data that was discussed in a previous chapter. It is assumed that it is stored in a variable
called tree:
> tree <- read.csv(file="trees91.csv",header=TRUE,sep=",");
> names(tree)
"RTCACC"
Each column in the data frame can be accessed as a vector. For example the numbers
associated with the leaf biomass (LFBM) can be found using tree$LFBM:
> tree$LFBM
[1] 0.430 0.400 0.450 0.820 0.520 1.320 0.900 1.180 0.480 0.210 0.270
0.310
[13] 0.650 0.180 0.520 0.300 0.580 0.480 0.580 0.580 0.410 0.480 1.760
1.210
[25] 1.180 0.830 1.220 0.770 1.020 0.130 0.680 0.610 0.700 0.820 0.760
0.770
[37] 1.690 1.480 0.740 1.240 1.120 0.750 0.390 0.870 0.410 0.560 0.550
0.670
[49] 1.260 0.965 0.840 0.970 1.070 1.220
The following commands can be used to get the mean, median, quantiles, minimum,
maximum, variance, and standard deviation of a set of numbers:
> mean(tree$LFBM)
[1] 0.7649074
> median(tree$LFBM)
[1] 0.72
> quantile(tree$LFBM)
0% 25% 50% 75% 100%
0.1300 0.4800 0.7200 1.0075 1.7600
> min(tree$LFBM)
[1] 0.13
> max(tree$LFBM)
[1] 1.76
> var(tree$LFBM)
[1] 0.1429382
> sd(tree$LFBM)
[1] 0.3780717
Finally, the summary command will print out the min, max, mean, median, and quantiles:
> summary(tree$LFBM)
0.1300 0.4800 0.7200 0.7649 1.0080 1.7600
The summary command is especially nice because if you give it a data frame it will print out
the summary for every vector in the data frame:
> summary(tree)
C N CHBR REP LFBM
Min. :1.000 Min. :1.000 A1 : 3 Min. : 1.00 Min.
:0.1300
1st Qu.:2.000 1st Qu.:1.000 A4 : 3 1st Qu.: 9.00 1st
Qu.:0.4800
Median :2.000 Median :2.000 A6 : 3 Median :14.00 Median
:0.7200
Mean :2.519 Mean :1.926 B2 : 3 Mean :13.05 Mean
:0.7649
3rd Qu.:3.000 3rd Qu.:3.000 B3 : 3 3rd Qu.:20.00 3rd
Qu.:1.0075
Max. :4.000 Max. :3.000 B4 : 3 Max. :20.00 Max.
:1.7600
(Other):36 NA's :11.00
STBM RTBM LFNCC STNCC
Min. :0.0300 Min. :0.1200 Min. :0.880 Min. :0.3700
1st Qu.:0.1900 1st Qu.:0.2825 1st Qu.:1.312 1st Qu.:0.6400
Median :0.2450 Median :0.4450 Median :1.550 Median :0.7850
Mean :0.2883 Mean :0.4662 Mean :1.560 Mean :0.7872
3rd Qu.:0.3800 3rd Qu.:0.5500 3rd Qu.:1.788 3rd Qu.:0.9350
Max. :0.7200 Max. :1.5100 Max. :2.760 Max. :1.2900

RTNCC LFBCC STBCC RTBCC
Min. :0.4700 Min. :25.00 Min. :14.00 Min. :15.00
Max. :1.5500 Max. :48.00 Max. :27.00 Max. :41.00

LFCACC STCACC RTCACC LFKCC
Min. :0.2100 Min. :0.1300 Min. :0.1100 Min. :0.6500
Max. :0.3600 Max. :0.2400 Max. :0.2400 Max. :1.1800
NA's :1.0000
STKCC RTKCC LFMGCC STMGCC
Min. :0.870 Min. :0.330 Min. :0.0700 Min. :0.100
Max. :1.520 Max. :0.640 Max. :0.1400 Max. :0.190

RTMGCC LFPCC STPCC RTPCC
Min. :0.04000 Min. :0.1500 Min. :0.1500 Min. :0.1000
Max. :0.09000 Max. :0.3100 Max. :0.4100 Max. :0.2100

LFSCC STSCC RTSCC
Min. :0.0900 Min. :0.1400 Min. :0.0900
1st Qu.:0.1325 1st Qu.:0.1600 1st Qu.:0.1200
Median :0.1600 Median :0.1800 Median :0.1300
Mean :0.1661 Mean :0.1817 Mean :0.1298
3rd Qu.:0.1875 3rd Qu.:0.2000 3rd Qu.:0.1475
Max. :0.2600 Max. :0.2800 Max. :0.1700

3.3. Operations on Vectors
Here we look at some commonly used commands that perform operations on lists. The
commands include the sort, min, max, and sum commands. First, the sort command can sort
the given vector in either ascending or descending order:
> a = c(2,4,6,3,1,5)
> b = sort(a)
> c = sort(a,decreasing = TRUE)
> a
[1] 2 4 6 3 1 5
> b
[1] 1 2 3 4 5 6
> c
[1] 6 5 4 3 2 1
The min and the max commands find the minimum and the maximum numbers in the vector:
> min(a)
[1] 1
> max(a)
[1] 6
Finally, the sum command adds up the numbers in the vector:
> sum(a)
[1] 21

4. Basic Probability Distributions
Contents
The Normal Distribution
The t Distribution
The Binomial Distribution
The Chi-Squared Distribution
We look at some of the basic operations associated with probability distributions. There are a
large number of probability distributions available, but we only look at a few. If you would
like to know what distributions are available you can do a search using the command
help.search(distribution).
Here we give details about the commands associated with the normal distribution and briefly
mention the commands for other distributions. The functions for different distributions are
very similar where the differences are noted below.
For this chapter it is assumed that you know how to enter data which is covered in the
previous chapters.
To get a full list of the distributions available in R you can use the following command:
help(Distributions)
For every distribution there are four commands. The commands for each distribution are
prepended with a letter to indicate the functionality:
d returns the height of the probability density function
p returns the cumulative density function
q returns the inverse cumulative density function (quantiles)
r returns randomly generated numbers

4.1. The Normal Distribution
There are four functions that can be used to generate the values associated with the normal
distribution. You can get a full list of them and their options using the help command:
> help(Normal)
The first function we look at it is dnorm. Given a set of values it returns the height of the
probability distribution at each point. If you only give the points it assumes you want to use a
mean of zero and standard deviation of one. There are options to use different values for the
mean and standard deviation, though:
> dnorm(0)
[1] 0.3989423
> dnorm(0)*sqrt(2*pi)
[1] 1
> dnorm(0,mean=4)
[1] 0.0001338302
> dnorm(0,mean=4,sd=10)
[1] 0.03682701
>v <- c(0,1,2)
> dnorm(v)
[1] 0.39894228 0.24197072 0.05399097
> x <- seq(-20,20,by=.1)
> y <- dnorm(x)
> plot(x,y)
> y <- dnorm(x,mean=2.5,sd=0.1)
> plot(x,y)
The second function we examine is pnorm. Given a number or a list it computes the
probability that a normally distributed random number will be less than that number. This
function also goes by the rather ominous title of the Cumulative Distribution Function. It
accepts the same options as dnorm:
> pnorm(0)
[1] 0.5
> pnorm(1)
[1] 0.8413447
> pnorm(0,mean=2)
[1] 0.02275013
> pnorm(0,mean=2,sd=3)
[1] 0.2524925
> v <- c(0,1,2)
> pnorm(v)
[1] 0.5000000 0.8413447 0.9772499
> x <- seq(-20,20,by=.1)
> y <- pnorm(x)
> plot(x,y)
> y <- pnorm(x,mean=3,sd=4)
> plot(x,y)
If you wish to find the probability that a number is larger than the given number you can use
the lower.tail option:
> pnorm(0,lower.tail=FALSE)
[1] 0.5
> pnorm(1,lower.tail=FALSE)
[1] 0.1586553
> pnorm(0,mean=2,lower.tail=FALSE)
[1] 0.9772499
The next function we look at is qnorm which is the inverse of pnorm. The idea behind qnorm
is that you give it a probability, and it returns the number whose cumulative distribution
matches the probability. For example, if you have a normally distributed random variable
with mean zero and standard deviation one, then if you give the function a probability it
returns the associated Z-score:
> qnorm(0.5)
[1] 0
> qnorm(0.5,mean=1)
[1] 1
> qnorm(0.5,mean=1,sd=2)
[1] 1
[1] 2
[1] 2
[1] 0.6510205
> qnorm(0.333)
[1] -0.4316442
> qnorm(0.333,sd=3)
[1] -1.294933
[1] 6.34898
> v = c(0.1,0.3,0.75)
> qnorm(v)
[1] -1.2815516 -0.5244005 0.6744898
> x <- seq(0,1,by=.05)
> y <- qnorm(x)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=2)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=0.1)
> plot(x,y)
The last function we examine is the rnorm function which can generate random numbers
whose distribution is normal. The argument that you give it is the number of random numbers
that you want, and it has optional arguments to specify the mean and standard deviation:
> rnorm(4)
[1] 1.2387271 -0.2323259 -1.2003081 -1.6718483
> rnorm(4,mean=3)
[1] 2.633080 3.617486 2.038861 2.601933
> rnorm(4,mean=3,sd=3)
[1] 4.580556 2.974903 4.756097 6.395894
> rnorm(4,mean=3,sd=3)
[1] 3.000852 3.714180 10.032021 3.295667
> y <- rnorm(200)
> hist(y)
> y <- rnorm(200,mean=-2)
> hist(y)
> y <- rnorm(200,mean=-2,sd=4)
> hist(y)
> qqnorm(y)
> qqline(y)

4.2. The t Distribution
There are four functions that can be used to generate the values associated with the t
> help(TDist)
These commands work just like the commands for the normal distribution. One difference is
that the commands assume that the values are normalized to mean zero and standard
deviation one, so you have to use a little algebra to use these functions in practice. The other
difference is that you have to specify the number of degrees of freedom. The commands
follow the same kind of naming convention, and the names of the commands are dt, pt, qt,
and rt.
A few examples are given below to show how to use the different commands. First we have
the distribution function, dt:
> x <- seq(-20,20,by=.5)
> y <- dt(x,df=10)
> plot(x,y)
> y <- dt(x,df=50)
> plot(x,y)
Next we have the cumulative probability distribution function:
> pt(-3,df=10)
[1] 0.006671828
> pt(3,df=10)
[1] 0.9933282
> 1-pt(3,df=10)
[1] 0.006671828
> pt(3,df=20)
[1] 0.996462
> x = c(-3,-4,-2,-1)
> pt((mean(x)-2)/sd(x),df=20)
[1] 0.001165548
> pt((mean(x)-2)/sd(x),df=40)
[1] 0.000603064
Next we have the inverse cumulative probability distribution function:
> qt(0.05,df=10)
[1] -1.812461
> qt(0.95,df=10)
[1] 1.812461
> qt(0.05,df=20)
[1] -1.724718
> qt(0.95,df=20)
[1] 1.724718
> v <- c(0.005,.025,.05)
> qt(v,df=253)
[1] -2.595401 -1.969385 -1.650899
> qt(v,df=25)
[1] -2.787436 -2.059539 -1.708141
Finally random numbers can be generated according to the t distribution:
> rt(3,df=10)
[1] 0.9440930 2.1734365 0.6785262
> rt(3,df=20)
[1] 0.1043300 -1.4682198 0.0715013
> rt(3,df=20)
[1] 0.8023832 -0.4759780 -1.0546125

4.3. The Binomial Distribution
There are four functions that can be used to generate the values associated with the binomial
> help(Binomial)
These commands work just like the commands for the normal distribution. The binomial
distribution requires two extra parameters, the number of trials and the probability of success
for a single trial. The commands follow the same kind of naming convention, and the names
of the commands are dbinom, pbinom, qbinom, and rbinom.
the distribution function, dbinom:
> x <- seq(0,50,by=1)
> y <- dbinom(x,50,0.2)
> plot(x,y)
> y <- dbinom(x,50,0.6)
> plot(x,y)
> x <- seq(0,100,by=1)
> y <- dbinom(x,100,0.6)
> plot(x,y)
> pbinom(24,50,0.5)
[1] 0.4438624
> pbinom(25,50,0.5)
[1] 0.5561376
> pbinom(25,51,0.5)
[1] 0.5
> pbinom(26,51,0.5)
[1] 0.610116
> pbinom(25,50,0.5)
[1] 0.5561376
> pbinom(25,50,0.25)
[1] 0.999962
> pbinom(25,500,0.25)
[1] 4.955658e-33
> qbinom(0.5,51,1/2)
[1] 25
> qbinom(0.25,51,1/2)
[1] 23
> pbinom(23,51,1/2)
[1] 0.2879247
> pbinom(22,51,1/2)
[1] 0.200531
Finally random numbers can be generated according to the binomial distribution:
> rbinom(5,100,.2)
[1] 30 23 21 19 18
> rbinom(5,100,.7)
[1] 66 66 58 68 63

4.4. The Chi-Squared Distribution
There are four functions that can be used to generate the values associated with the Chi-
Squared distribution. You can get a full list of them and their options using the help
command:
> help(Chisquare)
These commands work just like the commands for the normal distribution. The first
difference is that it is assumed that you have normalized the value so no mean can be
specified. The other difference is that you have to specify the number of degrees of freedom.
The commands follow the same kind of naming convention, and the names of the commands
are dchisq, pchisq, qchisq, and rchisq.
the distribution function, dchisq:
> x <- seq(-20,20,by=.5)
> y <- dchisq(x,df=10)
> plot(x,y)
> y <- dchisq(x,df=12)
> plot(x,y)
> pchisq(2,df=10)
[1] 0.003659847
> pchisq(3,df=10)
[1] 0.01857594
> 1-pchisq(3,df=10)
[1] 0.981424
> pchisq(3,df=20)
[1] 4.097501e-06
> x = c(2,4,5,6)
> pchisq(x,df=20)
[1] 1.114255e-07 4.649808e-05 2.773521e-04 1.102488e-03
> qchisq(0.05,df=10)
[1] 3.940299
[1] 18.30704
[1] 10.85081
[1] 31.41043
> v <- c(0.005,.025,.05)
> qchisq(v,df=253)
[1] 198.8161 210.8355 217.1713
> qchisq(v,df=25)
[1] 10.51965 13.11972 14.61141
Finally random numbers can be generated according to the Chi-Squared distribution:
> rchisq(3,df=10)
[1] 16.80075 20.28412 12.39099
> rchisq(3,df=20)
[1] 17.838878 8.591936 17.486372
> rchisq(3,df=20)
[1] 11.19279 23.86907 24.81251

5. Basic Plots
Contents
Strip Charts
Histograms
Boxplots
Scatter Plots
Normal QQ Plots
We look at some of the ways R can display information graphically. This is a basic
introduction to some of the basic plotting commands. It is assumed that you know how to
enter data or read data files which is covered in the first chapter, and it is assumed that you
are familiar with the different data types.
In each of the topics that follow it is assumed that two different data sets, w1.dat and
trees91.csv have been read and defined using the same variables as in the first chapter. Both
of these data sets come from the study discussed on the web site given in the first chapter. We
assume that they are read using read.csv into variables w1 and tree:
> w1 <- read.csv(file="w1.dat",sep=",",head=TRUE)
> names(w1)
[1] "vals"
> tree <- read.csv(file="trees91.csv",sep=",",head=TRUE)
> names(tree)
"RTCACC"

5.1. Strip Charts
A strip chart is the most basic type of plot available. It plots the data in order along a line
with each data point represented as a box. Here we provide examples using the w1 data frame
mentioned at the top of this page, and the one column of the data is w1$vals.
To create a strip chart of this data use the stripchart command:
> help(stripchart)
> stripchart(w1$vals)

Strip Chart
This is the most basic possible strip charts. The stripchart() command takes many of the
standard plot() options for labeling and annotations.
As you can see this is about as bare bones as you can get. There is no title nor axes labels. It
only shows how the data looks if you were to put it all along one line and mark out a box at
each point. If you would prefer to see which points are repeated you can specify that repeated
points be stacked:
> stripchart(w1$vals,method="stack")
A variation on this is to have the boxes moved up and down so that there is more separation
between them:
> stripchart(w1$vals,method="jitter")
If you do not want the boxes plotting in the horizontal direction you can plot them in the
vertical direction:
> stripchart(w1$vals,vertical=TRUE)
> stripchart(w1$vals,vertical=TRUE,method="jitter")
Since you should always annotate your plots there are many different ways to add titles and
labels. One way is within the stripchart command itself:
> stripchart(w1$vals,method="stack",
main='Leaf BioMass in High CO2 Environment',
xlab='BioMass of Leaves')
If you have a plot already and want to add a title, you can use the title command:
> title('Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves')
Note that this simply adds the title and labels and will write over the top of any titles or labels
you already have.

5.2. Histograms
A histogram is very common plot. It plots the frequencies that data appears within certain
ranges. Here we provide examples using the w1 data frame mentioned at the top of this page,
and the one column of data is w1$vals.
To plot a histogram of the data use the hist command:
> hist(w1$vals)
> hist(w1$vals,main="Distribution of w1",xlab="w1")

Histogram Options
Many of the basic plot commands accept the same options. The help(hist) command will give
you options specifically for the hist command. You can also use the help command to see
more but also note that if you use help(plot) you may see more options. Experiment with
different options to see what you can do.
As you can see R will automatically calculate the intervals to use. There are many options to
determine how to break up the intervals. Here we look at just one way, varying the domain
size and number of breaks. If you would like to know more about the other options check out
the help page:
> help(hist)
You can specify the number of breaks to use using the breaks option. Here we look at the
histogram for various numbers of breaks:
> hist(w1$vals,breaks=2)
>
You can also vary the size of the domain using the xlim option. This option takes a vector
with two entries in it, the left value and the right value:
> hist(w1$vals,breaks=12,xlim=c(0,10))
> hist(w1$vals,breaks=12,xlim=c(-1,2))
> hist(w1$vals,breaks=12,xlim=c(0,2))
> hist(w1$vals,breaks=12,xlim=c(1,1.3))
> hist(w1$vals,breaks=12,xlim=c(0.9,1.3))
>
The options for adding titles and labels are exactly the same as for strip charts. You should
always annotate your plots and there are many different ways to add titles and labels. One
way is within the hist command itself:
> hist(w1$vals,
xlab='BioMass of Leaves')
If you have a plot already and want to change or add a title, you can use the title command:
> title('Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves')
Note that this simply adds the title and labels and will write over the top of any titles or labels
you already have.
It is not uncommon to add other kinds of plots to a histogram. For example, one of the
options to the stripchart command is to add it to a plot that has already been drawn. For
example, you might want to have a histogram with the strip chart drawn across the top. The
addition of the strip chart might give you a better idea of the density of the data:
> hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of
Leaves',ylim=c(0,16))
> stripchart(w1$vals,add=TRUE,at=15.5)

5.3. Boxplots
A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a
data set. Here we provide examples using two different data sets. The first is the w1 data
frame mentioned at the top of this page, and the one column of data is w1$vals. The second is
the tree data frame from the trees91.csv data file which is also mentioned at the top of the
page.
We first use the w1 data set and look at the boxplot of this data set:
> boxplot(w1$vals)

Again, this is a very plain graph, and the title and labels can be specified in exactly the same
way as in the stripchart and hist commands:
> boxplot(w1$vals,
ylab='BioMass of Leaves')
Note that the default orientation is to plot the boxplot vertically. Because of this we used the
ylab option to specify the axis label. There are a large number of options for this command.
To see more of the options see the help page:
> help(boxplot)
As an example you can specify that the boxplot be plotted horizontally by specifying the
horizontal option:
> boxplot(w1$vals,
xlab='BioMass of Leaves',
horizontal=TRUE)
The option to plot the box plot horizontally can be put to good use to display a box plot on
the same image as a histogram. You need to specify the add option, specify where to put the
box plot using the at option, and turn off the addition of axes using the axes option:
> boxplot(w1$vals,horizontal=TRUE,at=15.5,add=TRUE,axes=FALSE)
If you are feeling really crazy you can take a histogram and add a box plot and a strip chart:
> boxplot(w1$vals,horizontal=TRUE,at=16,add=TRUE,axes=FALSE)
> stripchart(w1$vals,add=TRUE,at=15)
Some people shell out good money to have this much fun.
For the second part on boxplots we will look at the second data frame, tree, which comes
from the trees91.csv file. To reiterate the discussion at the top of this page and the discussion
in the data types chapter, we need to specify which columns are factors:
> tree <- read.csv(file="trees91.csv",sep=",",head=TRUE)
> tree$C <- factor(tree$C)
> tree$N <- factor(tree$N)
We can look at the boxplot of just the data for the stem biomass:
> boxplot(tree$STBM,
main='Stem BioMass in Different CO2 Environments',
ylab='BioMass of Stems')
That plot does not tell the whole story. It is for all of the trees, but the trees were grown in
different kinds of environments. The boxplot command can be used to plot a separate box
plot for each level. In this case the data is held in tree$STBM, and the different levels are
stored as factors in tree$C. The command to create different boxplots is the following:
boxplot(tree$STBM~tree$C)
Note that for the level called 2 there are four outliers which are plotted as little circles.
There are many options to annotate your plot including different labels for each level. Please
use the help(boxplot) command for more information.

5.4. Scatter Plots
A scatter plot provides a graphical view of the relationship between two sets of numbers.
Here we provide examples using the tree data frame from the trees91.csv data file which is
mentioned at the top of the page. In particular we look at the relationship between the stem
biomass (tree$STBM) and the leaf biomass (tree$LFBM).
The command to plot each pair of points as an x-coordinate and a y-coorindate is plot:
> plot(tree$STBM,tree$LFBM)
It appears that there is a strong positive association between the biomass in the stems of a tree
and the leaves of the tree. It appears to be a linear relationship. In fact, the corelation between
these two sets of observations is quite high:
> cor(tree$STBM,tree$LFBM)
[1] 0.911595
Getting back to the plot, you should always annotate your graphs. The title and labels can be
specified in exactly the same way as with the other plotting commands:
> plot(tree$STBM,tree$LFBM,
main="Relationship Between Stem and Leaf Biomass",
xlab="Stem Biomass",
ylab="Leaf Biomass")

5.5. Normal QQ Plots
The final type of plot that we look at is the normal quantile plot. This plot is used to
determine if your data is close to being normally distributed. You cannot be sure that the data
is normally distributed, but you can rule out if it is not normally distributed. Here we provide
examples using the w1 data frame mentioned at the top of this page, and the one column of
data is w1$vals.
The command to generate a normal quantile plot is qqnorm. You can give it one argument,
the univariate data set of interest:
> qqnorm(w1$vals)
You can annotate the plot in exactly the same way as all of the other plotting commands
given here:
> qqnorm(w1$vals,
main="Normal Q-Q Plot of the Leaf Biomass",
xlab="Theoretical Quantiles of the Leaf Biomass",
ylab="Sample Quantiles of the Leaf Biomass")
After you creat the normal quantile plot you can also add the theoretical line that the data
should fall on if they were normally distributed:
> qqline(w1$vals)
In this example you should see that the data is not quite normally distributed. There are a few
outliers, and it does not match up at the tails of the distribution.

6. Intermediate Plotting
Contents
Continuous Data
Discrete Data
Miscellaneous Options
We look at some more options for plotting, and we assume that you are familiar with the
basic plotting commands (Basic Plots). A variety of different subjects ranging from plotting
options to the formatting of plots is given.
In many of the examples below we use some of Rs commands to generate random numbers
according to various distributions. The section is divided into three sections. The focus of the
first section is on graphing continuous data. The focus of the second section is on graphing
discrete data. The third section offers some miscellaneous options that are useful in a variety
of contexts.

6.1. Continuous Data
Contents
Multiple Data Sets on One Plot
Error Bars
Adding Noise (jitter)
Multiple Graphs on One Image
Density Plots
Pairwise Relationships
Shaded Regions
Plotting a Surface
In the examples below a data set is defined using Rs normally distributed random number
generator.
> x <- rnorm(10,sd=5,mean=20)
> y <- 2.5*x - 1.0 + rnorm(10,sd=9,mean=0)
> cor(x,y)
[1] 0.7400576

6.1.1. Multiple Data Sets on One Plot
One common task is to plot multiple data sets on the same plot. In many situations the way to
do this is to create the initial plot and then add additional information to the plot. For
example, to plot bivariate data the plot command is used to initialize and create the plot. The
points command can then be used to add additional data sets to the plot.
First define a set of normally distributed random numbers and then plot them. (This same
data set is used throughout the examples below.)
> x <- rnorm(10,sd=5,mean=20)
> y <- 2.5*x - 1.0 + rnorm(10,sd=9,mean=0)
> cor(x,y)
[1] 0.7400576
> plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
> x1 <- runif(8,15,25)
> y1 <- 2.5*x1 - 1.0 + runif(8,-6,6)
> points(x1,y1,col=2)
Note that in the previous example, the colour for the second set of data points is set using the
col option. You can try different numbers to see what colours are available. For most
installations there are at least eight options from 1 to 8. Also note that in the example above
the points are plotted as circles. The symbol that is used can be changed using the pch option.
> x2 <- runif(8,15,25)
> y2 <- 2.5*x2 - 1.0 + runif(8,-6,6)
> points(x2,y2,col=3,pch=2)
Again, try different numbers to see the various options. Another helpful option is to add a
legend. This can be done with the legend command. The options for the command, in order,
are the x and y coordinates on the plot to place the legend followed by a list of labels to use.
There are a large number of other options so use help(legend) to see more options. For
example a list of colors can be given with the col option, and a list of symbols can be given
with the pch option.
> legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))

Figure 1.
The three data sets displayed on the same graph.
Another common task is to change the limits of the axes to change the size of the plotting
area. This is achieved using the xlim and ylim options in the plot command. Both options take
a vector of length two that have the minimum and maximum values.
> plot(x,y,xlab="Independent",ylab="Dependent",main="Random
Stuff",xlim=c(0,30),ylim=c(0,100))
> legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))

6.1.2. Error Bars
Another common task is to add error bars to a set of data points. This can be accomplished
using the arrows command. The arrows command takes two pairs of coordinates, that is two
pairs of x and y values. The command then draws a line between each pair and adds an
arrow head with a given length and angle.
> xHigh <- x
> yHigh <- y + abs(rnorm(10,sd=3.5))
> xLow <- x
> yLow <- y - abs(rnorm(10,sd=3.1))
> arrows(xHigh,yHigh,xLow,yLow,col=2,angle=90,length=0.1,code=3)

Figure 2.
A data set with error bars added.
Note that the option code is used to specify where the bars are drawn. Its value can be 1, 2, or
3. If code is 1 the bars are drawn at pairs given in the first argument. If code is 2 the bars are
drawn at the pairs given in the second argument. If code is 3 the bars are drawn at both.

6.1.3. Adding Noise (jitter)
In the previous example a little bit of noise was added to the pairs to produce an artificial
offset. This is a common thing to do for making plots. A simpler way to accomplish this is to
use the jitter command.
> numberWhite <- rhyper(400,4,5,3)
> numberChipped <- rhyper(400,2,7,3)
> par(mfrow=c(1,2))
> plot(numberWhite,numberChipped,xlab="Number White Marbles Drawn",
ylab="Number Chipped Marbles Drawn",main="Pulling Marbles")
> plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles
Drawn",
ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With
Jitter")

Figure 3.
Points with noise added using the jitter command.

6.1.4. Multiple Graphs on One Image
Note that a new command was used in the previous example. The par command can be used
to set different parameters. In the example above the mfrow was set. The plots are arranged in
an array where the default number of rows and columns is one. The mfrow parameter is a
vector with two entries. The first entry is the number of rows of images. The second entry is
the number of columns. In the example above the plots were arranged in one row with two
plots across.
> par(mfrow=c(2,3))
> boxplot(numberWhite,main="first plot")
> boxplot(numberChipped,main="second plot")
> plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles
Drawn",
ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With
Jitter")
> hist(numberWhite,main="fourth plot")
> hist(numberChipped,main="fifth plot")
> mosaicplot(table(numberWhite,numberChipped),main="sixth plot")

Figure 4.
An array of plots using the par command.

6.1.5. Density Plots
There are times when you do not want to plot specific points but wish to plot a density. This
can be done using the smoothScatter command.
> smoothScatter(numberWhite,numberChipped,
xlab="White Marbles",ylab="Chipped Marbles",main="Drawing
Marbles")

Figure 5.
The SmoothScatter can be used to plot densities.
Note that the previous example may benefit by superimposing a grid to help delimit the
points of interest. This can be done using the grid command.
> smoothScatter(numberWhite,numberChipped,
xlab="White Marbles",ylab="Chipped Marbles",main="Drawing
Marbles")
> grid(4,3)

6.1.6. Pairwise Relationships
There are times that you want to explore a large number of relationships. A number of
relationships can be plotted at one time using the pairs command. The idea is that you give it
a matrix or a data frame, and the command will create a scatter plot of all combinations of the
data.
> uData <- rnorm(20)
> vData <- rnorm(20,mean=5)
> wData <- uData + 2*vData + rnorm(20,sd=0.5)
> xData <- -2*uData+rnorm(20,sd=0.1)
> yData <- 3*vData+rnorm(20,sd=2.5)
> d <- data.frame(u=uData,v=vData,w=wData,x=xData,y=yData)
> pairs(d)

Figure 5.
Using pairs to produce all permutations of a set of relationships on one graph.

6.1.7. Shaded Regions
A shaded region can be plotted using the polygon command. The polygon command takes a
pair of vectors, x and y, and shades the region enclosed by the coordinate pairs. In the
example below a blue square is drawn. The vertices are defined starting from the lower left.
Five pairs of points are given because the starting point and the ending point is the same.
> x = c(-1,1,1,-1,-1)
> y = c(-1,-1,1,1,-1)
> plot(x,y)
> polygon(x,y,col='blue')
>
A more complicated example is given below. In this example the rejection region for a right
sided hypothesis test is plotted, and it is shaded in red. A set of custom axes is constructed,
and symbols are plotted using the expression command.
> stdDev <- 0.75;
> x <- seq(-5,5,by=0.01)
> y <- dnorm(x,sd=stdDev)
> right <- qnorm(0.95,sd=stdDev)
> plot(x,y,type="l",xaxt="n",ylab="p",
xlab=expression(paste('Assumed Distribution of ',bar(x))),
axes=FALSE,ylim=c(0,max(y)*1.05),xlim=c(min(x),max(x)),
frame.plot=FALSE)
> axis(1,at=c(-5,right,0,5),
pos = c(0,0),
labels=c(expression('
'),expression(bar(x)[cr]),expression(mu[0]),expression(' ')))
> axis(2)
> xReject <- seq(right,5,by=0.01)
> yReject <- dnorm(xReject,sd=stdDev)
> polygon(c(xReject,xReject[length(xReject)],xReject[1]),
c(yReject,0, 0), col='red')

Figure 6.
Using polygon to produce a shaded region.
The axes are drawn separately. This is done by first suppressing the plotting of the axes in the
plot command, and the horizontal axis is drawn separately. Also note that the expression
command is used to plot a Greek character and also produce subscripts.

6.1.8. Plotting a Surface
Finally, a brief example of how to plot a surface is given. The persp command will plot a
surface with a specified perspective. In the example, a grid is defined by multiplying a row
and column vector to give the x and then the y values for a grid. Once that is done a sine
function is specified on the grid, and the persp command is used to plot it.
> x <- seq(0,2*pi,by=pi/100)
> y <- x
> xg <- (x*0+1) %*% t(y)
> yg <- (x) %*% t(y*0+1)
> f <- sin(xg+yg)
> persp(x,y,f,theta=-10,phi=40)
>
The %*% notation is used to perform matrix multiplication.

6.2. Discrete Data
Contents
Barplot
Mosaic Plot
In the examples below a data set is defined using Rs hypergeometric random number
generator.
6.2.1. Barplot
The plot command will try to produce the appropriate plots based on the data type. The data
that is defined above, though, is numeric data. You need to convert the data to factors to
make sure that the plot command treats it in an appropriate way. The as.factor command is
used to cast the data as factors and ensures that R treats it as discrete data.
> numberWhite <- as.factor(numberWhite)
> plot(numberWhite)
>
In this case R will produce a barplot. The barplot command can also be used to create a
barplot. The barplot command requires a vector of heights, though, and you cannot simply
give it the raw data. The frequencies for the barplot command can be easily calculated using
the table command.
> totals <- table(numberWhite)
> totals
numberWhite
0 1 2 3
4 13 11 2
> barplot(totals,main="Number Draws",ylab="Frequency",xlab="Draws")
>
In the previous example the barplot command is used to set the title for the plot and the labels
for the axes. The labels on the ticks for the horizontal axis are automatically generated using
the labels on the table. You can change the labels by setting the row names of the table.
> totals <- table(numberWhite)
> rownames(totals) <- c("none","one","two","three")
> totals
numberWhite
none one two three
4 13 11 2
> barplot(totals,main="Number Draws",ylab="Frequency",xlab="Draws")
>
The order of the frequencies is the same as the order in the table. If you change the order in
the table it will change the way it appears in the barplot. For example, if you wish to arrange
the frequencies in descending order you can use the sort command with the decreasing option
set to TRUE.
> barplot(sort(totals,decreasing=TRUE),main="Number
Draws",ylab="Frequency",xlab="Draws")
The indexing features of R can be used to change the order of the frequencies manually.
> totals
numberWhite
none one two three
4 13 11 2
> sort(totals,decreasing=TRUE)
numberWhite
one two none three
13 11 4 2
> totals[c(3,1,4,2)]
numberWhite
two none three one
11 4 2 13
> barplot(totals[c(3,1,4,2)])
>
The barplot command returns the horizontal locations of the bars. Using the locations and
putting together the previous ideas a Pareto Chart can be constructed.
> xLoc = barplot(sort(totals,decreasing=TRUE),main="Number Draws",
ylab="Frequency",xlab="Draws",ylim=c(0,sum(totals)+2))
> points(xLoc,cumsum(sort(totals,decreasing=TRUE)),type='p',col=2)
> points(xLoc,cumsum(sort(totals,decreasing=TRUE)),type='l')
>

6.2.2. Mosaic Plot
Mosaic plots are used to display proportions for tables that are divided into two or more
conditional distributions. Here we focus on two way tables to keep things simpler. It is
assumed that you are familiar with using tables in R (see the section on two way tables for
more information: Two Way Tables).
Here we will use a made up data set primarily to make it easier to figure out what R is doing.
The fictitious data set is defined below. The idea is that sixteen children of age eight are
interviewed. They are asked two questions. The first question is, do you believe in Santa
Claus. If they say that they do then the term belief is recorded, otherwise the term no
belief is recorded. The second question is whether or not they have an older brother, older
sister, or no older sibling. (We are keeping it simple here!) The answers that are recorded are
older brother, older sister, or no older sibling.
> santa <- data.frame(belief=c('no belief','no belief','no belief','no
belief',
'belief','belief','belief','belief',
'belief','belief','no belief','no belief',
'belief','belief','no belief','no belief'),
sibling=c('older brother','older brother','older
brother','older sister',
'no older sibling','no older sibling','no
older sibling','older sister',
'older brother','older sister','older
brother','older sister',
'no older sibling','older sister','older
brother','no older sibling')
)
> santa
belief sibling
1 no belief older brother
4 no belief older sister
5 belief no older sibling
8 belief older sister
9 belief older brother
12 no belief older sister
16 no belief no older sibling
> summary(santa)
belief sibling
belief :8 no older sibling:5
no belief:8 older brother :6
older sister :5
The data is given as strings, so R will automatically treat them as categorical data, and the
data types are factors. If you plot the individual data sets, the plot command will default to
producing barplots.
> plot(santa$belief)
> plot(santa$sibling)
>
If you provide both data sets it will automatically produce a mosaic plot which demonstrates
the relative frequencies in terms of the resulting areas.
> plot(santa$sibling,santa$belief)
> plot(santa$belief,santa$sibling)
The mosaicplot command can be called directly
> totals = table(santa$belief,santa$sibling)
> totals

no older sibling older brother older sister
belief 4 1 3
no belief 1 5 2
> mosaicplot(totals,main="Older Brothers are Jerks",
xlab="Belief in Santa Claus",ylab="Older Sibling")
The colours of the plot can be specified by setting the col argument. The argument is a vector
of colours used for the rows. See Fgure :ref`figure7_intermediatePlotting` for an example.
xlab="Belief in Santa Claus",ylab="Older Sibling",
col=c(2,3,4))

Figure 7.
Example of a mosaic plot with colours.
The labels and the order that they appear in the plot can be changed in exactly the same way
as given in the examples for barplot above.
> rownames(totals)
[1] "belief" "no belief"
> colnames(totals)
[1] "no older sibling" "older brother" "older sister"
> rownames(totals) <- c("Believes","Does not Believe")
> colnames(totals) <- c("No Older","Older Brother","Older Sister")
> totals

No Older Older Brother Older Sister
Believes 4 1 3
Does not Believe 1 5 2
xlab="Belief in Santa Claus",ylab="Older Sibling")
When changing the order keep in mind that the table is a two dimensional array. The indices
must include both rows and columns, and the transpose command (t) can be used to switch
how it is plotted with respect to the vertical and horizontal axes.
> totals

No Older Older Brother Older Sister
Believes 4 1 3
> totals[c(2,1),c(2,3,1)]

Older Brother Older Sister No Older
Believes 1 3 4
> mosaicplot(totals[c(2,1),c(2,3,1)],main="Older Brothers are Jerks",
xlab="Belief in Santa Claus",ylab="Older Sibling",col=c(2,3,4))
> mosaicplot(t(totals),main="Older Brothers are Jerks",
ylab="Belief in Santa Claus",xlab="Older Sibling",col=c(2,3))

6.3. Miscellaneous Options
Contents
Multiple Representations On One Plot
Multiple Windows
Print To A File
Annotation and Formatting
The previous examples only provide a slight hint at what is possible. Here we give some
examples that provide a demonstration of the way the different commands can be combined
and the options that allow them to be used together.

6.3.1. Multiple Representations On One Plot
First, an example of a histogram with an approximation of the density function is given. In
addition to the density function a horizontal boxplot is added to the plot with a rug
representation of the data on the horizontal axis. The horizontal bounds on the histogram will
be specified. The boxplot must be added to the histogram, and it will be raised above the
histogram.
> x = rexp(20,rate=4)
> hist(x,ylim=c(0,18),main="This Are An Histogram",xlab="X")
> boxplot(x,at=16,horizontal=TRUE,add=TRUE)
> rug(x,side=1)
> d = density(x)
> points(d,type='l',col=3)
>
6.3.2. Multiple Windows
The dev commands allow you to create and manipulate multiple graphics windows. You can
create new windows using the dev.new() command, and you can choose which one to make
active using the dev.set() command. The dev.list(), dev.next(), and dev.prev() command can
be used to list the graphical devices that are available.
In the following example three devices are created. They are listed, and different plots are
created on the different devices.
> dev.new()
> dev.new()
> dev.new()
> dev.list()
X11cairo X11cairo X11cairo
2 3 4
> dev.set(3)
X11cairo
3
> x = rnorm(20)
> hist(x)
> dev.set(2)
X11cairo
2
> boxplot(x)
> dev.set(4)
X11cairo
4
> qqnorm(x)
> qqline(x)
> dev.next()
X11cairo
2
> dev.set(dev.next())
X11cairo
2
> plot(density(x))
>

6.3.3. Print To A File
There are a couple ways to print a plot to a file. It is important to be able to work with
graphics devices as shown in the previous subsection (Multiple Windows). The first way
explored is to use the dev.print command. This command will print a copy of the currently
active device, and the format is defined by the device argument.
In the example below, the current window is printed to a png file called hist.png that is 200
pixels wide.
> x = rnorm(100)
> hist(x)
> dev.print(device=png,width=200,"hist.png")
>
To find out what devices are available on your system use the help command.
> help(device)
Another way to print to a file is to create a device in the same way as the graphical devices
were created in the previous section. Once the device is created, the various plot commands
are given, and then the device is turned off to write the results to a file.
> png(file="hist.png")
> hist(x)
> rug(x,side=1)
> dev.off()

6.3.4. Annotation and Formatting
Basic annotation can be performed in the regular plotting commmands. For example, there
are options to specify labels on axes as well as titles. More options are available using the
axis command.
Most of the primary plotting commands have an option to turn off the generation of the axes
using the axes=FALSE option. The axes can be then added using the axis command which
allows for a greater number of options.
In the example below a bivariate set of random numbers are generated and plotted as a scatter
plot. The axes are added, but the horizontal axis is located in the center of the data rather than
at the bottom of the figure. Note that the horizontal and vertical axes are added separately,
and are specified using the first argument to the command. (Use help(axis) for a full list of
options.)
> x <- rnorm(10,mean=0,sd=4)
> y <- 3*x-1+rnorm(10,mean=0,sd=2)
summary(x)
-6.1550 -1.9280 1.2000 -0.1425 2.4780 3.1630
> summary(y)
-17.9800 -9.0060 0.7057 -1.2060 8.2600 10.9200
> plot(x,y,axes=FALSE,col=2)
> axis(1,pos=c(0,0),at=seq(-7,5,by=1))
> axis(2,pos=c(0,0),at=seq(-18,11,by=2))
>
In the previous example the at option is used to specify the tick marks.
When using the plot command the default behavior is to draw an axis as well as draw a box
around the plotting area. The drawing of the box can be suppressed using the bty option. The
value can be o, l, 7, c, u, ], or n. (The lines drawn roughly look like the letter
given except for n which draws no lines.) The box can be drawn later using the box
command as well.
> y <- 3*x-1+rnorm(10,mean=0,sd=2)
> plot(x,y,bty="7")
> plot(x,y,bty="n")
> box(lty=3)
>
The par command can be used to set the default values for various parameters. A couple are
given below. In the example below the default background is set to grey, no box will be
drawn around the window, and the margins for the axes will be twice the normal size.
> par(bty="l")
> par(bg="gray")
> par(mex=2)
> y <- 3*x-1+rnorm(10,mean=0,sd=2)
> plot(x,y)
>
Another common task is to place a text string on the plot. The text command takes a
coordinate and a label, and it places the label at the given coordinate. The text command has
options for setting the offset, size, font, and other options. In the example below the label
numbers! is placed on the plot. Use help(text) to see more options.
> y <- 3*x-1+rnorm(10,mean=0,sd=2)
> plot(x,y)
> text(-1,-2,"numbers!")
>
The default text command will cut off any characters outside of the plot area. This behavior
can be overridden using the xpd option.
> y <- 3*x-1+rnorm(10,mean=0,sd=2)
> plot(x,y)
> text(-7,-2,"outside the area",xpd=TRUE)
>

7. Indexing Into Vectors
Contents
Indexing With Logicals
Not Available or Missing Values
Indices With Logical Expression
Given a vector of data one common task is to isolate particular entries or censor items that
meet some criteria. Here we show how to use Rs indexing notation to pick out specific items
within a vector.

7.1. Indexing With Logicals
We first give an example of how to select specific items in a vector. The first step is to define
a vector of data, and the second step is to define a vector made up of logical values. When the
vector of logical values is used for the index into the vector of data values only the items
corresponding to the variables that evaluate to TRUE are returned:
> a <- c(1,2,3,4,5)
> b <- c(TRUE,FALSE,FALSE,TRUE,FALSE)
> a[b]
[1] 1 4
> max(a[b])
[1] 4
> sum(a[b])
[1] 5

7.2. Not Available or Missing Values
One common problem is data entries that are marked NA or not available. There is a
predefined variable called NA that can be used to indicate missing information. The problem
with this is that some functions throw an error if one of the entries in the data is NA. Some
functions allow you to ignore the missing values through special options:
> a <- c(1,2,3,4,NA)
> a
[1] 1 2 3 4 NA
> sum(a)
[1] NA
> sum(a,na.rm=TRUE)
[1] 10
There are other times, though, when this option is not available, or you simply want to censor
them. The is.na function can be used to determine which items are not available. The logical
not operator in R is the ! symbol. When used with the indexing notation the items within a
vector that are NA can be easily removed:
> a <- c(1,2,3,4,NA)
> is.na(a)
[1] FALSE FALSE FALSE FALSE TRUE
> !is.na(a)
[1] TRUE TRUE TRUE TRUE FALSE
> a[!is.na(a)]
[1] 1 2 3 4
> b <- a[!is.na(a)]
> b
[1] 1 2 3 4

7.3. Indices With Logical Expression
Any logical expression can be used as an index which opens a wide range of possibilities. For
example, you can remove or focus on entries that match specific criteria. For example, you
might want to remove all entries that are above a certain value:
> a = c(6,2,5,3,8,2)
> a
[1] 6 2 5 3 8 2
> b = a[a<6]
> b
[1] 2 5 3 2
For another example, suppose you want to join together the values that match two different
factors in another vector:
> d = data.frame(one=as.factor(c('a','a','b','b','c','c')),
two=c(1,2,3,4,5,6))
> d
one two
1 a 1
2 a 2
3 b 3
4 b 4
5 c 5
6 c 6
> both = d$two[(d$one=='a') | (d$one=='b')]
> both
[1] 1 2 3 4
Note that a single | was used in the previous example. There is a difference between || and
|. A single bar will perform a vector operation, term by term, while a double bar will
evaluate to a single TRUE or FALSE result:
> (c(TRUE,TRUE))|(c(FALSE,TRUE))
[1] TRUE TRUE
> (c(TRUE,TRUE))||(c(FALSE,TRUE))
[1] TRUE
> (c(TRUE,TRUE))&(c(FALSE,TRUE))
[1] FALSE TRUE
> (c(TRUE,TRUE))&&(c(FALSE,TRUE))
[1] FALSE

8. Linear Least Squares Regression
Here we look at the most basic linear least squares regression. The main purpose is to provide
an example of the basic commands. It is assumed that you know how to enter data or read
data files which is covered in the first chapter, and it is assumed that you are familiar with the
different data types.
We will examine the interest rate for four year car loans, and the data that we use comes from
the U.S. Federal Reserves mean rates . We are looking at and plotting means. This, of
course, is a very bad thing because it removes a lot of the variance and is misleading. The
only reason that we are working with the data in this way is to provide an example of linear
regression that does not use too many data points. Do not try this without a professional near
you, and if a professional is not near you do not tell anybody you did this. They will laugh at
you. People are mean, especially professionals.
The first thing to do is to specify the data. Here there are only five pairs of numbers so we
can enter them in manually. Each of the five pairs consists of a year and the mean interest
rate:
> year <- c(2000 , 2001 , 2002 , 2003 , 2004)
> rate <- c(9.34 , 8.50 , 7.62 , 6.93 , 6.60)
The next thing we do is take a look at the data. We first plot the data using a scatter plot and
notice that it looks linear. To confirm our suspicions we then find the correlation between the
year and the mean interest rates:
> plot(year,rate,
main="Commercial Banks Interest Rate for 4 Year Car Loan",
sub="http://www.federalreserve.gov/releases/g19/20050805/")
> cor(year,rate)
[1] -0.9880813
At this point we should be excited because associations that strong never happen in the real
world unless you cook the books or work with averaged data. The next question is what
straight line comes closest to the data? In this case we will use least squares regression as
one way to determine the line.
Before we can find the least square regression line we have to make some decisions. First we
have to decide which is the explanatory and which is the response variable. Here, we
arbitrarily pick the explanatory variable to be the year, and the response variable is the
interest rate. This was chosen because it seems like the interest rate might change in time
rather than time changing as the interest rate changes. (We could be wrong, finance is very
confusing.)
The command to perform the least square regression is the lm command. The command has
many options, but we will keep it simple and not explore them here. If you are interested use
the help(lm) command to learn more. Instead the only option we examine is the one necessary
argument which specifies the relationship.
Since we specified that the interest rate is the response variable and the year is the
explanatory variable this means that the regression line can be written in slope-intercept
form:
rate=(slope)year+(intercept)
The way that this relationship is defined in the lm command is that you write the vector
containing the response variable, a tilde (~), and a vector containing the explanatory
variable:
> fit <- lm(rate ~ year)
> fit
Call:
lm(formula = rate ~ year)

Coefficients:
(Intercept) year
1419.208 -0.705
When you make the call to lm it returns a variable with a lot of information in it. If you are
just learning about least squares regression you are probably only interested in two things at
this point, the slope and the y-intercept. If you just type the name of the variable returned by
lm it will print out this minimal information to the screen. (See above.)
If you would like to know what else is stored in the variable you can use the attributes
command:
> attributes(fit)
$names
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"

$class
[1] "lm"
One of the things you should notice is the coefficients variable within fit. You can print out
the y-intercept and slope by accessing this part of the variable:
> fit$coefficients[1]
(Intercept)
1419.208
> fit$coefficients[[1]]
[1] 1419.208
> fit$coefficients[2]
year
-0.705
> fit$coefficients[[2]]
[1] -0.705
Note that if you just want to get the number you should use two square braces. So if you want
to get an estimate of the interest rate in the year 2015 you can use the formula for a line:
> fit$coefficients[[2]]*2015+fit$coefficients[[1]]
[1] -1.367
So if you just wait long enough, the banks will pay you to take a car!
A better use for this formula would be to calculate the residuals and plot them:
> res <- rate - (fit$coefficients[[2]]*year+fit$coefficients[[1]])
> res
[1] 0.132 -0.003 -0.178 -0.163 0.212
> plot(year,res)
That is a bit messy, but fortunately there are easier ways to get the residuals. Two other ways
are shown below:
> residuals(fit)
1 2 3 4 5
0.132 -0.003 -0.178 -0.163 0.212
> fit$residuals
1 2 3 4 5
0.132 -0.003 -0.178 -0.163 0.212
> plot(year,fit$residuals)
>
If you want to plot the regression line on the same plot as your scatter plot you can use the
abline function along with your variable fit:
> plot(year,rate,
main="Commercial Banks Interest Rate for 4 Year Car Loan",
sub="http://www.federalreserve.gov/releases/g19/20050805/")
> abline(fit)
Finally, as a teaser for the kinds of analyses you might see later, you can get the results of an
F-test by asking R for a summary of the fit variable:
> summary(fit)

Call:
lm(formula = rate ~ year)

Residuals:
1 2 3 4 5
0.132 -0.003 -0.178 -0.163 0.212

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1419.20800 126.94957 11.18 0.00153 **
year -0.70500 0.06341 -11.12 0.00156 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2005 on 3 degrees of freedom
Multiple R-Squared: 0.9763, Adjusted R-squared: 0.9684
F-statistic: 123.6 on 1 and 3 DF, p-value: 0.001559

9. Calculating Confidence Intervals
Contents
Calculating a Confidence Interval From a Normal Distribution
Calculating a Confidence Interval From a t Distribution
Calculating Many Confidence Intervals From a t Distribution
Here we look at some examples of calculating confidence intervals. The examples are for
both normal and t distributions. We assume that you can enter data and know the commands
associated with basic probability. Note that an easier way to calculate confidence intervals
using the t.test command is discussed in section The Easy Way.

9.1. Calculating a Confidence Interval From a Normal Distribution
Here we will look at a fictitious example. We will make some assumptions for what we might
find in an experiment and find the resulting confidence interval using a normal distribution.
Here we assume that the sample mean is 5, the standard deviation is 2, and the sample size is
20. In the example below we will use a 95% confidence level and wish to find the confidence
interval. The commands to find the confidence interval in R are the following:
> a <- 5
> s <- 2
> n <- 20
> error <- qnorm(0.975)*s/sqrt(n)
> left <- a-error
> right <- a+error
> left
[1] 4.123477
> right
[1] 5.876523
The true mean has a probability of 95% of being in the interval between 4.12 and 5.88
assuming that the original random variable is normally distributed, and the samples are
independent.

9.2. Calculating a Confidence Interval From a t Distribution
Calculating the confidence interval when using a t-test is similar to using a normal
distribution. The only difference is that we use the command associated with the t-
distribution rather than the normal distribution. Here we repeat the procedures above, but we
will assume that we are working with a sample standard deviation rather than an exact
standard deviation.
Again we assume that the sample mean is 5, the sample standard deviation is 2, and the
sample size is 20. We use a 95% confidence level and wish to find the confidence interval.
The commands to find the confidence interval in R are the following:
> a <- 5
> s <- 2
> n <- 20
> error <- qt(0.975,df=n-1)*s/sqrt(n)
> left <- a-error
> right <- a+error
> left
[1] 4.063971
> right
[1] 5.936029
The true mean has a probability of 95% of being in the interval between 4.06 and 5.94
assuming that the original random variable is normally distributed, and the samples are
independent.
We now look at an example where we have a univariate data set and want to find the 95%
confidence interval for the mean. In this example we use one of the data sets given in the data
input chapter. We use the w1.dat data set:
> summary(w1)
vals
Min. :0.130
1st Qu.:0.480
Median :0.720
Mean :0.765
3rd Qu.:1.008
Max. :1.760
> length(w1$vals)
[1] 54
> mean(w1$vals)
[1] 0.765
> sd(w1$vals)
[1] 0.3781222
We can now calculate an error for the mean:
> error <- qt(0.975,df=length(w1$vals)-1)*sd(w1$vals)/sqrt(length(w1$vals))
> error
[1] 0.1032075
The confidence interval is found by adding and subtracting the error from the mean:
> left <- mean(w1$vals)-error
> right <- mean(w1$vals)+error
> left
[1] 0.6617925
> right
[1] 0.8682075
There is a 95% probability that the true mean is between 0.66 and 0.87 assuming that the
original random variable is normally distributed, and the samples are independent.

9.3. Calculating Many Confidence Intervals From a t Distribution
Suppose that you want to find the confidence intervals for many tests. This is a common task
and most software packages will allow you to do this.
We have three different sets of results:
Comparison 1
Mean Std. Dev. Number (pop.)
Group I 10 3 300
Group II 10.5 2.5 230

Comparison 2
Group I 12 4 210
Group II 13 5.3 340

Comparison 3
Group I 30 4.5 420
Group II 28.5 3 400
For each of these comparisons we want to calculate the associated confidence interval for the
difference of the means. For each comparison there are two groups. We will refer to group
one as the group whose results are in the first row of each comparison above. We will refer to
group two as the group whose results are in the second row of each comparison above.
Before we can do that we must first compute a standard error and a t-score. We will find
general formulae which is necessary in order to do all three calculations at once.
We assume that the means for the first group are defined in a variable called m1. The means
for the second group are defined in a variable called m2. The standard deviations for the first
group are in a variable called sd1. The standard deviations for the second group are in a
variable called sd2. The number of samples for the first group are in a variable called num1.
Finally, the number of samples for the second group are in a variable called num2.
With these definitions the standard error is the square root of (sd1^2)/num1+(sd2^2)/num2.
The R commands to do this can be found below:
> m1 <- c(10,12,30)
> m2 <- c(10.5,13,28.5)
> sd1 <- c(3,4,4.5)
> sd2 <- c(2.5,5.3,3)
> num1 <- c(300,210,420)
> num2 <- c(230,340,400)
> se <- sqrt(sd1*sd1/num1+sd2*sd2/num2)
> error <- qt(0.975,df=pmin(num1,num2)-1)*se
To see the values just type in the variable name on a line alone:
> m1
[1] 10 12 30
> m2
[1] 10.5 13.0 28.5
> sd1
[1] 3.0 4.0 4.5
> sd2
[1] 2.5 5.3 3.0
> num1
[1] 300 210 420
> num2
[1] 230 340 400
> se
[1] 0.2391107 0.3985074 0.2659216
> error
[1] 0.4711382 0.7856092 0.5227825
Now we need to define the confidence interval around the assumed differences. Just as in the
case of finding the p values in previous chapter we have to use the pmin command to get the
number of degrees of freedom. In this case the null hypotheses are for a difference of zero,
and we use a 95% confidence interval:
> left <- (m1-m2)-error
> right <- (m1-m2)+error
> left
[1] -0.9711382 -1.7856092 0.9772175
> right
[1] -0.02886177 -0.21439076 2.02278249
This gives the confidence intervals for each of the three tests. For example, in the first
experiment the 95% confidence interval is between -0.97 and -0.03 assuming that the random
variables are normally distributed, and the samples are independent.

10. Calculating p Values
Contents
Calculating a Single p Value From a Normal Distribution
Calculating a Single p Value From a t Distribution
Calculating Many p Values From a t Distribution
The Easy Way
Here we look at some examples of calculating p values. The examples are for both normal
and t distributions. We assume that you can enter data and know the commands associated
with basic probability. We first show how to do the calculations the hard way and show how
to do the calculations. The last method makes use of the t.test command and demonstrates an
easier way to calculate a p value.

10.1. Calculating a Single p Value From a Normal Distribution
We look at the steps necessary to calculate the p value for a particular test. In the interest of
simplicity we only look at a two sided test, and we focus on one example. Here we want to
show that the mean is not close to a fixed value, a.
Ho:x=a,
Ha:xa,
The p value is calculated for a particular sample mean. Here we assume that we obtained a
sample mean, x and want to find its p value. It is the probability that we would obtain a given
sample mean that is greater than the absolute value of its Z-score or less than the negative of
the absolute value of its Z-score.
For the special case of a normal distribution we also need the standard deviation. We will
assume that we are given the standard deviation and call it s. The calculation for the p value
can be done in several of ways. We will look at two ways here. The first way is to convert the
sample means to their associated Z-score. The other way is to simply specify the standard
deviation and let the computer do the conversion. At first glance it may seem like a no
brainer, and we should just use the second method. Unfortunately, when using the t-
distribution we need to convert to the t-score, so it is a good idea to know both ways.
We first look at how to calculate the p value using the Z-score. The Z-score is found by
assuming that the null hypothesis is true, subtracting the assumed mean, and dividing by the
theoretical standard deviation. Once the Z-score is found the probability that the value could
be less the Z-score is found using the pnorm command.
This is not enough to get the p value. If the Z-score that is found is positive then we need to
take one minus the associated probability. Also, for a two sided test we need to multiply the
result by two. Here we avoid these issues and insure that the Z-score is negative by taking the
negative of the absolute value.
We now look at a specific example. In the example below we will use a value of a of 5, a
standard deviation of 2, and a sample size of 20. We then find the p value for a sample mean
of 7:
> a <- 5
> s <- 2
> n <- 20
> xbar <- 7
> z <- (xbar-a)/(s/sqrt(n))
> z
[1] 4.472136
> 2*pnorm(-abs(z))
[1] 7.744216e-06
We now look at the same problem only specifying the mean and standard deviation within
the pnorm command. Note that for this case we cannot so easily force the use of the left tail.
Since the sample mean is more than the assumed mean we have to take two times one minus
the probability:
> a <- 5
> s <- 2
> n <- 20
> xbar <- 7
> 2*(1-pnorm(xbar,mean=a,sd=s/sqrt(20)))
[1] 7.744216e-06

10.2. Calculating a Single p Value From a t Distribution
Finding the p value using a t distribution is very similar to using the Z-score as demonstrated
above. The only difference is that you have to specify the number of degrees of freedom.
Here we look at the same example as above but use the t distribution instead:
> a <- 5
> s <- 2
> n <- 20
> xbar <- 7
> t <- (xbar-a)/(s/sqrt(n))
> t
[1] 4.472136
> 2*pt(-abs(t),df=n-1)
[1] 0.0002611934
We now look at an example where we have a univariate data set and want to find the p value.
In this example we use one of the data sets given in the data input chapter. We use the w1.dat
data set:
> summary(w1)
vals
Min. :0.130
1st Qu.:0.480
Median :0.720
Mean :0.765
3rd Qu.:1.008
Max. :1.760
> length(w1$vals)
[1] 54
Here we use a two sided hypothesis test,
Ho:x=0.7,
Ha:x0.7.
So we calculate the sample mean and sample standard deviation in order to calculate the p
value:
> t <- (mean(w1$vals)-0.7)/(sd(w1$vals)/sqrt(length(w1$vals)))
> t
[1] 1.263217
> 2*pt(-abs(t),df=length(w1$vals)-1)
[1] 0.21204

10.3. Calculating Many p Values From a t Distribution
Suppose that you want to find the p values for many tests. This is a common task and most
software packages will allow you to do this. Here we see how it can be done in R.
Here we assume that we want to do a one-sided hypothesis test for a number of comparisons.
In particular we will look at three hypothesis tests. All are of the following form:
Ho:12=0,
Ha:120.
We have three different sets of comparisons to make:
Comparison 1
Group I 10 3 300
Group II 10.5 2.5 230

Comparison 2
Group I 12 4 210
Group II 13 5.3 340

Comparison 3
Group I 30 4.5 420
Group II 28.5 3 400
For each of these comparisons we want to calculate a p value. For each comparison there are
two groups. We will refer to group one as the group whose results are in the first row of each
comparison above. We will refer to group two as the group whose results are in the second
row of each comparison above. Before we can do that we must first compute a standard error
and a t-score. We will find general formulae which is necessary in order to do all three
calculations at once.
The associated t-score is m1 minus m2 all divided by the standard error. The R comands to do
this can be found below:
> m1 <- c(10,12,30)
> m2 <- c(10.5,13,28.5)
> sd1 <- c(3,4,4.5)
> sd2 <- c(2.5,5.3,3)
> num1 <- c(300,210,420)
> num2 <- c(230,340,400)
> t <- (m1-m2)/se
> m1
[1] 10 12 30
> m2
[1] 10.5 13.0 28.5
> sd1
[1] 3.0 4.0 4.5
> sd2
[1] 2.5 5.3 3.0
> num1
[1] 300 210 420
> num2
[1] 230 340 400
> se
[1] 0.2391107 0.3985074 0.2659216
> t
[1] -2.091082 -2.509364 5.640761
To use the pt command we need to specify the number of degrees of freedom. This can be
done using the pmin command. Note that there is also a command called min, but it does not
work the same way. You need to use pmin to get the correct results. The numbers of degrees
of freedom are pmin(num1,num2)-1. So the p values can be found using the following R
command:
> pt(t,df=pmin(num1,num2)-1)
[1] 0.01881168 0.00642689 0.99999998
If you enter all of these commands into R you should have noticed that the last p value is not
correct. The pt command gives the probability that a score is less that the specified t. The t-
score for the last entry is positive, and we want the probability that a t-score is bigger. One
way around this is to make sure that all of the t-scores are negative. You can do this by taking
the negative of the absolute value of the t-scores:
> pt(-abs(t),df=pmin(num1,num2)-1)
[1] 1.881168e-02 6.426890e-03 1.605968e-08
The results from the command above should give you the p values for a one-sided test. It is
left as an exercise how to find the p values for a two-sided test.

10.4. The Easy Way
The methods above demonstrate how to calculate the p values directly making use of the
standard formulae. There is another, more direct way to do this using the t.test command. The
t.test command takes a data set for an argument, and the default operation is to perform a two
sided hypothesis test.
> x = c(9.0,9.5,9.6,10.2,11.6)
> t.test(x)

One Sample t-test

data: x
t = 22.2937, df = 4, p-value = 2.397e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
8.737095 11.222905
sample estimates:
mean of x
9.98

> help(t.test)
>
That was an obvious result. If you want to test against a different assumed mean then you can
use the mu argument:
> x = c(9.0,9.5,9.6,10.2,11.6)
> t.test(x,mu=10)

One Sample t-test

data: x
t = -0.0447, df = 4, p-value = 0.9665
alternative hypothesis: true mean is not equal to 10
8.737095 11.222905
sample estimates:
mean of x
9.98
If you are interested in a one sided test then you can specify which test to employ using the
alternative option:
> x = c(9.0,9.5,9.6,10.2,11.6)
> t.test(x,mu=10,alternative="less")

One Sample t-test

data: x
t = -0.0447, df = 4, p-value = 0.4833
alternative hypothesis: true mean is less than 10
-Inf 10.93434
sample estimates:
mean of x
9.98
The t.test() command also accepts a second data set to compare two sets of samples. The
default is to treat them as independent sets, but there is an option to treat them as dependent
data sets. (Enter help(t.test) for more information.) To test two different samples, the first two
arguments should be the data sets to compare:
> x = c(9.0,9.5,9.6,10.2,11.6)
> y=c(9.9,8.7,9.8,10.5,8.9,8.3,9.8,9.0)
> t.test(x,y)
Welch Two Sample t-test

data: x and y
t = 1.1891, df = 6.78, p-value = 0.2744
alternative hypothesis true difference in means is not equal to 0
-0.6185513 1.8535513
sample estimates:
mean of x mean of y
9.9800 9.3625

11. Calculating The Power Of A Test
Contents
Calculating The Power Using a Normal Distribution
Calculating The Power Using a t Distribution
Calculating Many Powers From a t Distribution
Here we look at some examples of calculating the power of a test. The examples are for both
normal and t distributions. We assume that you can enter data and know the commands
associated with basic probability. All of the examples here are for a two sided test, and you
can adjust them accordingly for a one sided test.

11.1. Calculating The Power Using a Normal Distribution
Here we calculate the power of a test for a normal distribution for a specific example.
Suppose that our hypothesis test is the following:
Ho:x=a,
Ha:xa,
The power of a test is the probability that we can the reject null hypothesis at a given mean
that is away from the one specified in the null hypothesis. We calculate this probability by
first calculating the probability that we accept the null hypothesis when we should not. This is
the probability to make a type II error. The power is the probability that we do not make a
type II error so we then take one minus the result to get the power.
We can fail to reject the null hypothesis if the sample happens to be within the confidence
interval we find when we assume that the null hypothesis is true. To get the confidence
interval we find the margin of error and then add and subtract it to the proposed mean, a, to
get the confidence interval. We then turn around and assume instead that the true mean is at a
different, explicitly specified level, and then find the probability a sample could be found
within the original confidence interval.
In the example below the hypothesis test is for
Ho:x=5,
Ha:x5,
We will assume that the standard deviation is 2, and the sample size is 20. In the example
below we will use a 95% confidence level and wish to find the power to detect a true mean
that differs from 5 by an amount of 1.5. (All of these numbers are made up solely for this
example.) The commands to find the confidence interval in R are the following:
> a <- 5
> s <- 2
> n <- 20
> error <- qnorm(0.975)*s/sqrt(n)
> left <- a-error
> right <- a+error
> left
[1] 4.123477
> right
[1] 5.876523
Next we find the Z-scores for the left and right values assuming that the true mean is
5+1.5=6.5:
> assumed <- a + 1.5
> Zleft <- (left-assumed)/(s/sqrt(n))
> Zright <-(right-assumed)/(s/sqrt(n))
> p <- pnorm(Zright)-pnorm(Zleft)
> p
[1] 0.08163792
The probability that we make a type II error if the true mean is 6.5 is approximately 8.1%. So
the power of the test is 1-p:
> 1-p
[1] 0.918362
In this example, the power of the test is approximately 91.8%. If the true mean differs from 5
by 1.5 then the probability that we will reject the null hypothesis is approximately 91.8%.

11.2. Calculating The Power Using a t Distribution
Calculating the power when using a t-test is similar to using a normal distribution. One
difference is that we use the command associated with the t-distribution rather than the
normal distribution. Here we repeat the test above, but we will assume that we are working
with a sample standard deviation rather than an exact standard deviation. We will explore
three different ways to calculate the power of a test. The first method makes use of the
scheme many books recommend if you do not have the non-central distribution available.
The second does make use of the non-central distribution, and the third makes use of a single
command that will do a lot of the work for us.
In the example the hypothesis test is the same as above,
Ho:x=5,
Ha:x5,
Again we assume that the sample standard deviation is 2, and the sample size is 20. We use a
95% confidence level and wish to find the power to detect a true mean that differs from 5 by
an amount of 1.5. The commands to find the confidence interval in R are the following:
> a <- 5
> s <- 2
> n <- 20
> error <- qt(0.975,df=n-1)*s/sqrt(n)
> left <- a-error
> right <- a+error
> left
[1] 4.063971
> right
[1] 5.936029
The number of observations is large enough that the results are quite close to those in the
example using the normal distribution. Next we find the t-scores for the left and right values
assuming that the true mean is 5+1.5=6.5:
> assumed <- a + 1.5
> tleft <- (left-assumed)/(s/sqrt(n))
> tright <- (right-assumed)/(s/sqrt(n))
> p <- pt(tright,df=n-1)-pt(tleft,df=n-1)
> p
[1] 0.1112583
The probability that we make a type II error if the true mean is 6.5 is approximately 11.1%.
So the power of the test is 1-p:
> 1-p
[1] 0.8887417
In this example, the power of the test is approximately 88.9%. If the true mean differs from 5
by 1.5 then the probability that we will reject the null hypothesis is approximately 88.9%.
Note that the power calculated for a normal distribution is slightly higher than for this one
calculated with the t-distribution.
Another way to approximate the power is to make use of the non-centrality parameter. The
idea is that you give it the critical t scores and the amount that the mean would be shifted if
the alternate mean were the true mean. This is the method that most books recommend.
> ncp <- 1.5/(s/sqrt(n))
> t <- qt(0.975,df=n-1)
> pt(t,df=n-1,ncp=ncp)-pt(-t,df=n-1,ncp=ncp)
[1] 0.1111522
> 1-(pt(t,df=n-1,ncp=ncp)-pt(-t,df=n-1,ncp=ncp))
[1] 0.8888478
Again, we see that the probability of making a type II error is approximately 11.1%, and the
power is approximately 88.9%. Note that this is slightly different than the previous
calculation but is still close.
Finally, there is one more command that we explore. This command allows us to do the same
power calculation as above but with a single command.
> power.t.test(n=n,delta=1.5,sd=s,sig.level=0.05,
type="one.sample",alternative="two.sided",strict = TRUE)

One-sample t test power calculation

n = 20
delta = 1.5
sd = 2
sig.level = 0.05
power = 0.8888478
alternative = two.sided
This is a powerful command that can do much more than just calculate the power of a test.
For example it can also be used to calculate the number of observations necessary to achieve
a given power. For more information check out the help page, help(power.t.test).

11.3. Calculating Many Powers From a t Distribution
Suppose that you want to find the powers for many tests. This is a common task and most
software packages will allow you to do this. Here we see how it can be done in R. We use the
exact same cases as in the previous chapter.
Here we assume that we want to do a two-sided hypothesis test for a number of comparisons
and want to find the power of the tests to detect a 1 point difference in the means. In
particular we will look at three hypothesis tests. All are of the following form:
Ho:12=0,
Ha:120,
We have three different sets of comparisons to make:
Comparison 1
Group I 10 3 300
Group II 10.5 2.5 230

Comparison 2
Group I 12 4 210
Group II 13 5.3 340

Comparison 3
Group I 30 4.5 420
Group II 28.5 3 400
For each of these comparisons we want to calculate the power of the test. For each
comparison there are two groups. We will refer to group one as the group whose results are in
the first row of each comparison above. We will refer to group two as the group whose results
are in the second row of each comparison above. Before we can do that we must first
compute a standard error and a t-score. We will find general formulae which is necessary in
order to do all three calculations at once.
The R commands to do this can be found below:
> m1 <- c(10,12,30)
> m2 <- c(10.5,13,28.5)
> sd1 <- c(3,4,4.5)
> sd2 <- c(2.5,5.3,3)
> num1 <- c(300,210,420)
> num2 <- c(230,340,400)
> m1
[1] 10 12 30
> m2
[1] 10.5 13.0 28.5
> sd1
[1] 3.0 4.0 4.5
> sd2
[1] 2.5 5.3 3.0
> num1
[1] 300 210 420
> num2
[1] 230 340 400
> se
[1] 0.2391107 0.3985074 0.2659216
Now we need to define the confidence interval around the assumed differences. Just as in the
case of finding the p values in previous chapter we have to use the pmin command to get the
number of degrees of freedom. In this case the null hypotheses are for a difference of zero,
and we use a 95% confidence interval:
> left <- qt(0.025,df=pmin(num1,num2)-1)*se
> right <- -left
> left
[1] -0.4711382 -0.7856092 -0.5227825
> right
[1] 0.4711382 0.7856092 0.5227825
We can now calculate the power of the one sided test. Assuming a true mean of 1 we can
calculate the t-scores associated with both the left and right variables:
> tl <- (left-1)/se
> tr <- (right-1)/se
> tl
[1] -6.152541 -4.480743 -5.726434
> tr
[1] -2.2117865 -0.5379844 -1.7945799
> probII <- pt(tr,df=pmin(num1,num2)-1) -
pt(tl,df=pmin(num1,num2)-1)
> probII
[1] 0.01398479 0.29557399 0.03673874
> power <- 1-probII
> power
[1] 0.9860152 0.7044260 0.9632613
The results from the command above should give you the p-values for a two-sided test. It is
left as an exercise how to find the p-values for a one-sided test.
Just as was found above there is more than one way to calculate the power. We also include
the method using the non-central parameter which is recommended over the previous
method:
> t <- qt(0.975,df=pmin(num1,num2)-1)
> t
[1] 1.970377 1.971379 1.965927
> ncp <- (1)/se
> pt(t,df=pmin(num1,num2)-1,ncp=ncp)-pt(-t,df=pmin(num1,num2)-1,ncp=ncp)
[1] 0.01374112 0.29533455 0.03660842
> 1-(pt(t,df=pmin(num1,num2)-1,ncp=ncp)-pt(-t,df=pmin(num1,num2)-
1,ncp=ncp))
[1] 0.9862589 0.7046655 0.9633916

12. Two Way Tables
Contents
Creating a Table from Data
Creating a Table Directly
Tools For Working With Tables
Graphical Views of Tables
Here we look at some examples of how to work with two way tables. We assume that you
can enter data and understand the different data types.

12.1. Creating a Table from Data
We first look at how to create a table from raw data. Here we use a fictitious data set,
smoker.csv. This data set was created only to be used as an example, and the numbers were
created to match an example from a text book, p. 629 of the 4th edition of Moore and
McCabes Introduction to the Practice of Statistics. You should look at the data set in a
spreadsheet to see how it is entered. The information is ordered in a way to make it easier to
figure out what information is in the data.
The idea is that 356 people have been polled on their smoking status (Smoke) and their
socioeconomic status (SES). For each person it was determined whether or not they are
current smokers, former smokers, or have never smoked. Also, for each person their
socioeconomic status was determined (low, middle, or high). The data file contains only two
columns, and when read R interprets them both as factors:
> smokerData <- read.csv(file='smoker.csv',sep=',',header=T)
> summary(smokerData)
Smoke SES
current:116 High :211
former :141 Low : 93
never : 99 Middle: 52
> mosaicplot(smoke)
>
You can create a two way table of occurrences using the table command and the two columns
in the data frame:
> smoke <- table(smokerData$Smoke,smokerData$SES)
> smoke

High Low Middle
current 51 43 22
former 92 28 21
never 68 22 9
In this example, there are 51 people who are current smokers and are in the high SES. Note
that it is assumed that the two lists given in the table command are both factors. (More
information on this is available in the chapter on data types.)

12.2. Creating a Table Directly
Sometimes you are given data in the form of a table and would like to create a table. Here we
examine how to create the table directly. Unfortunately, this is not as direct a method as
might be desired. Here we create an array of numbers, specify the row and column names,
and then convert it to a table.
In the example below we will create a table identical to the one given above. In that example
we have 3 columns, and the numbers are specified by going across each row from top to
bottom. We need to specify the data and the number of rows:
> smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
> colnames(o) <- c("High","Low","Middle")
> rownames(o) <- c("current","former","never")
> smoke <- as.table(smoke)
> smoke
High Low Middle
current 51 43 22
former 92 28 21
never 68 22 9

12.3. Tools For Working With Tables
Here we look at some of the commands available to help look at the information in a table in
different ways. We assume that the data using one of the methods above, and the table is
called smoke. First, there are a couple of ways to get graphical views of the data:
> barplot(smoke,legend=T,beside=T,main='Smoking Status by SES')
> plot(smoke,main="Smoking Status By Socioeconomic Status")
There are a number of ways to get the marginal distributions using the margin.table
command. If you just give the command the table it calculates the total number of
observations. You can also calculate the marginal distributions across the rows or columns
based on the one optional argument:
> margin.table(smoke)
[1] 356
> margin.table(smoke,1)

current former never
116 141 99
> margin.table(smoke,2)

High Low Middle
211 93 52
Combining these commands you can get the proportions:
> smoke/margin.table(smoke)

High Low Middle
current 0.14325843 0.12078652 0.06179775
former 0.25842697 0.07865169 0.05898876
never 0.19101124 0.06179775 0.02528090
> margin.table(smoke,1)/margin.table(smoke)

current former never
0.3258427 0.3960674 0.2780899
> margin.table(smoke,2)/margin.table(smoke)

High Low Middle
0.5926966 0.2612360 0.1460674
That is a little obtuse, so fortunately, there is a better way to get the proportions using the
prop.table command. You can specify the proportions with respect to the different marginal
distributions using the optional argument:
> prop.table(smoke)

High Low Middle
current 0.14325843 0.12078652 0.06179775
former 0.25842697 0.07865169 0.05898876
never 0.19101124 0.06179775 0.02528090
> prop.table(smoke,1)

High Low Middle
current 0.4396552 0.3706897 0.1896552
former 0.6524823 0.1985816 0.1489362
never 0.6868687 0.2222222 0.0909091
> prop.table(smoke,2)

High Low Middle
current 0.2417062 0.4623656 0.4230769
former 0.4360190 0.3010753 0.4038462
never 0.3222749 0.2365591 0.1730769
If you want to do a chi-squared test to determine if the proportions are different, there is an
easy way to do this. If we want to test at the 95% confidence level we need only look at a
summary of the table:
> summary(smoke)
Number of cases in table: 356
Number of factors: 2
Test for independence of all factors:
Chisq = 18.51, df = 4, p-value = 0.0009808
Since the p-value is less that 5% we can reject the null hypothesis at the 95% confidence
level and can say that the proportions vary.
Of course, there is a hard way to do this. This is not for the faint of heart and involves some
linear algebra which we will not describe. If you wish to calculate the table of expected
values then you need to multiply the vectors of the margins and divide by the total number of
observations:
> expected <- as.array(margin.table(smoke,1)) %*%
t(as.array(margin.table(smoke,2))) / margin.table(smoke)
> expected

High Low Middle
current 68.75281 30.30337 16.94382
former 83.57022 36.83427 20.59551
never 58.67697 25.86236 14.46067
(The t function takes the transpose of the array.)
The result in this array and can be directly compared to the existing table. We need the square
of the difference between the two tables divided by the expected values. The sum of all these
values is the Chi-squared statistic:
> chi <- sum((expected - as.array(smoke))^2/expected)
> chi
[1] 18.50974
We can then get the p-value for this statistic:
> 1-pchisq(chi,df=4)
[1] 0.0009808236

12.4. Graphical Views of Tables
The plot command will automatically produce a mosaic plot if its primary argument is a
table. Alternatively, you can call the mosaicplot command directly.
> smokerData <- read.csv(file='smoker.csv',sep=',',header=T)
> smoke <- table(smokerData$Smoke,smokerData$SES)
> mosaicplot(smoke)
> help(mosaicplot)
>
The mosaicplot command takes many of the same arguments for annotating a plot:
> mosaicplot(smoke,main="Smokers",xlab="Status",ylab="Economic Class")
>
If you wish to switch which side (horizontal versus vertical) to determine the primary
proportion then you can use the sort option. This can be used to switch whether the width or
height is used for the first proportional length:
> mosaicplot(smoke,sort=c(2,1))
>
Finally if you wish to switch which side is used for the vertical and horzintal axis you can use
the dir option:
> mosaicplot(smoke,dir=c("v","h"))
>

13. Data Management
Contents
Appending Data
Applying Functions Across Data Elements
Here we look at some common tasks that come up when dealing with data. These tasks range
from assembling different data sets into more convenient forms and ways to apply functions
to different parts of the data sets. The topics in this section demonstrate some of the power of
R, but it may not be clear at first. The functions are commonly used in a wide variety of
circumstances for a number of different reasons. These tools have saved me a great deal of
time and effort in circumstances that I would not have predicted in advance.
The important thing to note, though, is that this section is called Data Management. It is
not called Data Manipulation. Politicians manipulate data, we manage them.

13.1. Appending Data
When you have more than one set of data you may want to bring them together. You can
bring different data sets together by appending as rows (rbind) or by appending as columns
(cbind). The first example shows how this done with two data frames. The arguments to the
functions can take any number of objects. We only use two here to keep the demonstration
simpler, but additional data frames can be appended in the same call. It is important to note
that when you bring things together as rows the names of the objects within the data frame
must be the same.
> a <- data.frame(one=c( 0, 1, 2),two=c("a","a","b"))
> b <- data.frame(one=c(10,11,12),two=c("c","c","d"))
> a
one two
1 0 a
2 1 a
3 2 b
> b
one two
1 10 c
2 11 c
3 12 d
> v <- rbind(a,b)
> typeof(v)
[1] "list"
> v
one two
1 0 a
2 1 a
3 2 b
4 10 c
5 11 c
6 12 d
> w <- cbind(a,b)
> typeof(w)
[1] "list"
> w
one two one two
1 0 a 10 c
2 1 a 11 c
3 2 b 12 d
> names(w) = c("one","two","three","four")
> w
one two three four
1 0 a 10 c
2 1 a 11 c
3 2 b 12 d
The same commands also work with vectors and matrices and behave in a similar manner.
> A = matrix(c( 1, 2, 3, 4, 5, 6),ncol=3,byrow=TRUE)
> A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> B = matrix(c(10,20,30,40,50,60),ncol=3,byrow=TRUE)
> B
[,1] [,2] [,3]
[1,] 10 20 30
[2,] 40 50 60
> V <- rbind(A,B)
> typeof(V)
[1] "double"
> V
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 10 20 30
[4,] 40 50 60
> W <- cbind(A,B)
> typeof(W)
[1] "double"
> W
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 10 20 30
[2,] 4 5 6 40 50 60

13.2. Applying Functions Across Data Elements
The various apply functions can be an invaluable tool when trying to work with subsets
within a data set. The different versions of the apply commands are used to take a function
and have the function perform an operation on each part of the data. There are a wide variety
of these commands, but we only look at two sets of them. The first set, lapply and sapply, is
used to apply a function to every element in a list. The second one, tapply, is used to apply a
function on each set broken up by a given set of factors.

13.2.1. Operations on Lists and Vectors
First, the lapply command is used to take a list of items and perform some function on each
member of the list. That is, the list includes a number of different objects. You want to
perform some operation on every object within the list. You can use lapply to tell R to go
through each item in the list and perform the desired action on each item.
In the following example a list is created with three elements. The first is a randomly
generated set of numbers with a normal distribution. The second is a randomly generated set
of numbers with an exponential distribution. The last is a set of factors. A summary is then
performed on each element in the list.
> x <- list(a=rnorm(200,mean=1,sd=10),
b=rexp(300,10.0),
c=as.factor(c("a","b","b","b","c","c")))
> lapply(x,summary)
$a
-26.65000 -6.91200 -0.39250 0.09478 6.86700 32.00000

$b
0.0001497 0.0242300 0.0633300 0.0895400 0.1266000 0.7160000

$c
a b c
1 3 2
The lapply command returns a list. The entries in the list have the same names as the entries
in the list that is passed to it. The values of each entry are the results from applying the
function. The sapply function is similar, but the difference is that it tries to turn the result into
a vector or matrix if possible. If it does not make sense then it returns a list just like the lapply
command.
> x <- list(a=rnorm(8,mean=1,sd=10),b=rexp(10,10.0))
> x
$a
[1] -0.3881426 6.2910959 13.0265859 -1.5296377 6.9285984 -28.3050569
[7] 11.9119731 -7.6036997

$b
[1] 0.212689007 0.081818395 0.222462531 0.181424705 0.168476454 0.002924134
[7] 0.007010114 0.016301837 0.081291728 0.055426055

> val <- lapply(x,mean)
> typeof(val)
[1] "list"
> val
$a
[1] 0.04146456

$b
[1] 0.1029825

> val$a
[1] 0.04146456
> val$b
[1] 0.1029825
>
>
> other <- sapply(x,mean)
> typeof(other)
[1] "double"
> other
a b
0.04146456 0.10298250
> other[1]
a
0.04146456
> other[2]
b
0.1029825

13.2.2. Operations By Factors
Another widely used variant of the apply functions is the tapply function. The tapply function
will take a list of data, usually a vector, a list of factors of the same list, and a function. It will
then apply the function to each subset of the data that matches each of the factors.
> val <- data.frame(a=c(1,2,10,20,5,50),
b=as.factor(c("a","a","b","b","a","b")))
> val
a b
1 1 a
2 2 a
3 10 b
4 20 b
5 5 a
6 50 b
> result <- tapply(val$a,val$b,mean)
> typeof(result)
[1] "double"
> result
a b
2.666667 26.666667
> result[1]
a
2.666667
> result[2]
b
26.66667
> result <- tapply(val$a,val$b,summary)
> typeof(result)
[1] "list"
> result
$a
1.000 1.500 2.000 2.667 3.500 5.000

$b
10.00 15.00 20.00 26.67 35.00 50.00

> result$a
1.000 1.500 2.000 2.667 3.500 5.000
> result$b
10.00 15.00 20.00 26.67 35.00 50.00
14. Scripting
Contents
Executing a file
Functions
We look at running commands from a source file. We also include an overview of the
different statements that are used for control-flow that determines which code is executed by
the interpreter.

14.1. Executing a file
Contents
Executing the commands in a File
if statements
for statements
while statements
repeat statements
break and next statements
switch statement
scan statement
In the next section the ways to execute the commands in a file using the source command are
given. The remaining sections are used to list the various flow control options that are
available in the R language definition. The language definition has a wide variety of control
functions which can be found using the help command.
> help(Control)
>

14.1.1. Executing the commands in a File
A set of R commands can be saved in a file and then executed as if you had typed them in
from the command line. The source command is used to read the file and execute the
commands in the same sequence given in the file.
> source('file.R')
> help(source)
>
If you simply source the file the commands are not printed, and the results of commands are
not printed. This can be overridden using the echo, print.eval, and verbose options.
Some examples are given assuming that a file, simpleEx.R, is in the current directory. The file
is given below:
# Define a variable.
x <- rnorm(10)

# calculate the mean of x and print out the results.
mux = mean(x)
cat("The mean of x is ",mean(x),"\n")

# print out a summary of the results
summary(x)
cat("The summary of x is \n",summary(x),"\n")
print(summary(x))
The file also demonstrates the use of # to specify comments. Anything after the # is ignored.
Also, the file demonstrates the use of cat and print to send results to the standard output. Note
that the commands have options to send results to a file. Use help for more information.
The output for the different options can be found below:
> source('simpleEx.R')
The mean of x is -0.4817475
The summary of x is
-2.24 -0.5342 -0.2862 -0.4817 -0.1973 0.4259
-2.2400 -0.5342 -0.2862 -0.4817 -0.1973 0.4259
>
>
>
> source('simpleEx.R',echo=TRUE)
-2.32600 -0.69140 -0.06772 -0.13540 0.46820 1.69600
>
>
>
> source('simpleEx.R',print.eval=TRUE)
The mean of x is 0.1230581
-1.7020 -0.2833 0.1174 0.1231 0.9103 1.2220
The summary of x is
-1.702 -0.2833 0.1174 0.1231 0.9103 1.222
-1.7020 -0.2833 0.1174 0.1231 0.9103 1.2220
>
>
>
> source('simpleEx.R',print.eval=FALSE)
The mean of x is 0.6279428
The summary of x is
-0.7334 -0.164 0.9335 0.6279 1.23 1.604
-0.7334 -0.1640 0.9335 0.6279 1.2300 1.6040
>
>
>
>
> source('simpleEx.R',verbose=TRUE)
'envir' chosen:<environment: R_GlobalEnv>
encoding = "native.enc" chosen
--> parsed 6 expressions; now eval(.)ing them:

>>>> eval(expression_nr. 1 )
=================

> # Define a variable.
> x <- rnorm(10)
curr.fun: symbol <-
.. after expression(x <- rnorm(10))

=================

> # calculate the mean of x and print out the results.
> mux = mean(x)
curr.fun: symbol =
.. after expression(mux = mean(x))

=================

> cat("The mean of x is ",mean(x),"\n")
The mean of x is -0.1090932
curr.fun: symbol cat
.. after expression(cat("The mean of x is ",mean(x),"\n"))

=================

> # print out a summary of the results
> summary(x)
curr.fun: symbol summary
-1.3820 -1.0550 -0.1995 -0.1091 0.6813 2.1050
.. after expression(summary(x))

=================

> cat("The summary of x is \n",summary(x),"\n")
The summary of x is
-1.382 -1.055 -0.1995 -0.1091 0.6813 2.105
curr.fun: symbol cat
.. after expression(cat("The summary of x is \n",summary(x),"\n"))

=================

> print(summary(x))
-1.3820 -1.0550 -0.1995 -0.1091 0.6813 2.1050
curr.fun: symbol print
.. after expression(print(summary(x)))
One common problem that occurs is that R may not know where to find a file.
> source('notThere.R')
Error in file(filename, "r", encoding = encoding) :
cannot open the connection
In addition: Warning message:
In file(filename, "r", encoding = encoding) :
cannot open file 'notThere.R': No such file or directory
R will search the current working directory. You can see what files are in the directory using
the dir command, and you can determine the current directory using the getwd command.
> getwd()
[1] "/home/black/public_html/tutorial/R/rst/source/R"
> dir()
[1] "plotting.rData" "power.R" "shadedRegion.R"
You can change the current directory, and the options available depend on how you are using
R. For example on a Windows PC or a Macintosh you can use the menu options to change
the working directory. You can choose the directory using a graphical file browser.
Otherwise, you can change to the correct directory before running R or use the setwd
command.

14.1.2. if statements
Conditional execution is available using the if statement and the corresponding else
statement.
> x = 0.1
> if( x < 0.2)
{
x <- x + 1
cat("increment that number!\n")
}
increment that number!
> x
[1] 1.1
The else statement can be used to specify an alternate option. In the example below note that
the else statement must be on the same line as the ending brace for the previous if block.
> x = 2.0
> if ( x < 0.2)
{
x <- x + 1
} else
{
x <- x - 1
cat("nah, make it smaller.\n");
}
nah, make it smaller.
> x
[1] 1
Finally, the if statements can be chained together for multiple options. The if statement is
considered a single code block, so more if statements can be added after the else.
> x = 1.0
> if ( x < 0.2)
{
x <- x + 1
} else if ( x < 2.0)
{
x <- 2.0*x
cat("not big enough!\n")
} else
{
x <- x - 1
cat("nah, make it smaller.\n");
}
not big enough!
> x
[1] 2
The argument to the if statement is a logical expression. A full list of logical operators can be
found in the types document focusing on logical variables (Logical).

14.1.3. for statements
The for loop can be used to repeat a set of instructions, and it is used when you know in
advance the values that the loop variable will have each time it goes through the loop. The
basic format for the for loop is for(var in seq) expr
An example is given below:
> for (lupe in seq(0,1,by=0.3))
{
cat(lupe,"\n");
}
0
0.3
0.6
0.9
>
> x <- c(1,2,4,8,16)
> for (loop in x)
{
cat("value of loop: ",loop,"\n");
}
value of loop: 1
value of loop: 2
value of loop: 4
value of loop: 8
value of loop: 16
See the section on breaks for more options (break and next statements)

14.1.4. while statements
The while loop can be used to repeat a set of instructions, and it is often used when you do
not know in advance how often the instructions will be executed. The basic format for a while
loop is while(cond) expr
>
> lupe <- 1;
> x <- 1
> while(x < 4)
{
x <- rnorm(1,mean=2,sd=3)
cat("trying this value: ",x," (",lupe," times in loop)\n");
lupe <- lupe + 1
}
trying this value: -4.163169 ( 1 times in loop)
trying this value: 3.061946 ( 2 times in loop)
trying this value: -2.06527 ( 4 times in loop)

14.1.5. repeat statements
The repeat loop is similar to the while loop. The difference is that it will always begin the
loop the first time. The while loop will only start the loop if the condition is true the first time
it is evaluated. Another difference is that you have to explicitly specify when to stop the loop
using the break command.
That is you need to execute the break statement to get out of the loop.
> repeat
{
x <- rnorm(1)
if(x < -2.0) break
}
> x
[1] -2.300532

14.1.6. break and next statements
The break statement is used to stop the execution of the current loop. It will break out of the
current loop. The next statement is used to skip the statements that follow and restart the
current loop. If a for loop is used then the next statement will update the loop variable.
> x <- rnorm(5)
> x
[1] 1.41699338 2.28086759 -0.01571884 0.56578443 0.60400784
> for(lupe in x)
{
if (lupe > 2.0)
next

if( (lupe<0.6) && (lupe > 0.5))
break

cat("The value of lupe is ",lupe,"\n");
}
The value of lupe is 1.416993
The value of lupe is -0.01571884

14.1.7. switch statement
The switch takes an expression and returns a value in a list based on the value of the
expression. How it does this depends on the data type of the expression. The basic syntax is
switch(statement,item1,item2,item3,...,itemN).
If the result of the expression is a number then it returns the item in the list with the same
index. Note that the expression is cast as an integer if it is not an integer.
> x <- as.integer(2)
> x
[1] 2
> z = switch(x,1,2,3,4,5)
> z
[1] 2
> x <- 3.5
> z = switch(x,1,2,3,4,5)
> z
[1] 3
If the result of the expression is a string, then the list of items should be in the form
valueN=resultN, and the statement will return the result that matches the value.
> y <- rnorm(5)
> y
[1] 0.4218635 -0.8205637 -1.0191267 -0.6080061 -0.6079133
> x <- "sd"
> z <-
switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y))
> z
[1] 0.5571847
> x <- "median"
> z <-
switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y))
> z
[1] -0.6080061

14.1.8. scan statement
The command to read input from the keyboard is the scan statement. It has a wide variety of
options and can be fine tuned to your specific needs. We only look at the basics here. The
scan statement waits for input from a user, and it returns the value that was typed in.
When using the command with no set number of lines the command will continue to read
keyboard input until a blank line is entered.
> help(scan)
> a <- scan(what=double(0))
1: 3.5
2:
Read 1 item
> a
[1] 3.5
> typeof(a)
[1] "double"
>
> a <- scan(what=double(0))
1: yo!
1:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
:
scan() expected 'a real', got 'yo!'
If you wish to only have it read from a fixed number of lines the nmax option can specify
how many lines can be typed in, and the multi.line option can be used to turn off multi-line
entry.
> a <- scan(what=double(0),nmax=1,multi.line = FALSE)
1: 6.7
Read 1 item
> a
[1] 6.7

14.2. Functions
A shallow overview of defining functions is given here. A few subtleties will be noted, but R
can be a little quirky with respect to defining functions. The first bit of oddness is that you
can think of a function as an object where you define the function and assign it to a variable
name.
To define a function you assign it to a name, and the keyword function is used to denote the
start of the function and its argument list.
> newDef <- function(a,b)
{
x = runif(10,a,b)
mean(x)
}
> newDef(-1,1)
[1] 0.06177728
> newDef
function(a,b)
{
x = runif(10,a,b)
mean(x)
}
The last expression in the function is what is returned. So in the example above the sample
mean of the numbers is returned.
> x <- newDef(0,1)
> x
[1] 0.4800442
The arguments that are passed are matched in order. They can be specified explicitly, though.
> newDef(b=10,a=1)
[1] 4.747509
> newDef(10,1)
[1] NaN
Warning message:
In runif(10, a, b) : NAs produced
You can mix this approach, and R will try to match up the named arguments and then match
the rest going from left to right. Another bit of weirdness is that R will not evaluate an
expression in the argument list until the moment it is needed in the function. This is a
different kind of behavior than what most people are used to, so be very careful about this.
The best rule of thumb is to not put in operations in an argument list if they matter after the
function is called.
Another common task is to have a function return multiple items. This can be accomplished
by returning a list of items. The objects within a list can be accessed using the same $
notation that is used for data frames.
> c = c(1,2,3,4,5)
> sample <- function(a,b)
{
value = switch(a,"median"=median(b),"mean"=mean(b),"variance"=var(b))
largeVals = length(c[c>1])
list(stat=value,number=largeVals)
}
> result <- sample("median",c)
> result
$stat
[1] 3

$number
[1] 4

> result$stat
[1] 3
> result$number
[1] 4
There is another potential problem that can occur when using a function in R. When it comes
to determining the value of a variable there is a path that R will use to search for its value. In
the case of functions if a previously undefined variable appears R will look at the argument
list for the function. Next it will look in the current work space. If you are not careful R will
find the value some place where you do not expect it, and your function will return a value
that is not correct, and no error will be given. Be very careful about the names of variables
especially when using functions.

15. Time Data Types
Contents
Time and Date Variables
Time Operations
The time data types are broken out into a separate section from the introductory section on
data types. (Basic Data Types) The reason for this is that dealing with time data can be subtle
and must be done carefully because the data type can be cast in a variety of different ways. It
is not an introductory topic, and if not done well can scare off the normal people.
I will first go over the basic time data types and then explore the different kinds of operations
that are done with the time data types. Please be cautious with time data and read the
complete description including the caveats. There are some common mistakes that
result in calculations that yield results that are very different from the intended values.

15.1. Time and Date Variables
There are a variety of different types specific to time data fields in R. Here we only look at
two, the POSIXct and POSIXlt data types:
POSIXct
The POSIXct data type is the number of seconds since the start of January 1, 1970. Negative numbers
represent the number of seconds before this time, and positive numbers represent the number of
seconds afterwards.
POSIXlt
The POSIXlt data type is a vector, and the entries in the vector have the following meanings:
1. seconds
2. minutes
3. hours
4. day of month (1-31)
5. month of the year (0-11)
6. years since 1900
7. day of the week (0-6 where 0 represents Sunday)
8. day of the year (0-365)
9. Daylight savings indicator (positive if it is daylight savings)
Part of the difficulty with time data types is that R prints them out in a way that is different
from how it stores them internally. This can make type conversions tricky, and you have to be
careful and test your operations to insure that R is doing what you think it is doing.
To get the current time, the Sys.time() can be used, and you can play around a bit with the
basic types to get a feel for what R is doing. The as.POSIXct and as.POSIXlt commands are
used to convert the time value into the different formats.
> help(DateTimeClasses)
> t <- Sys.time()
> typeof(t)
[1] "double"
> t
[1] "2014-01-23 14:28:21 EST"
> print(t)
[1] "2014-01-23 14:28:21 EST"
> cat(t,"\n")
1390505301
> c <- as.POSIXct(t)
> typeof(c)
[1] "double"
> print(c)
[1] "2014-01-23 14:28:21 EST"
> cat(c,"\n")
1390505301
>
>
> l <- as.POSIXlt(t)
> l
[1] "2014-01-23 14:28:21 EST"
> typeof(l)
[1] "list"
> print(l)
[1] "2014-01-23 14:28:21 EST"
> cat(l,"\n")
Error in cat(list(...), file, sep, fill, labels, append) :
argument 1 (type 'list') cannot be handled by 'cat'
> names(l)
NULL
> l[[1]]
[1] 21.01023
> l[[2]]
[1] 28
> l[[3]]
[1] 14
> l[[4]]
[1] 23
> l[[5]]
[1] 0
> l[[6]]
[1] 114
> l[[7]]
[1] 4
> l[[8]]
[1] 22
> l[[9]]
[1] 0
>
> b <- as.POSIXct(l)
> cat(b,"\n")
1390505301
There are times when you have a time data type and want to convert it into a string so it can
be saved into a file to be read by another application. The strftime command is used to take a
time data type and convert it to a string. You must supply an additional format string to let R
what format you want to use. See the help page on strftime to get detailed information about
the format string.
> help(strftime)
>
> t <- Sys.time()
> cat(t,"\n")
1390506463
> timeStamp <- strftime(t,"%Y-%m-%d %H:%M:%S")
> timeStamp
[1] "2014-01-23 14:47:43"
> typeof(timeStamp)
[1] "character"
Commonly a time stamp is saved in a data file, and it must be converted into a time data type
to allow for calculations. For example, you may be interested in how much time has elapsed
between two observations. The strptime command is used to take a string and convert it into a
time data type. Like strftime it requires a format string in addition to the time stamp.
The strptime command is used to take a string and convert it into a form that R can use for
calculations. In the following example a data frame is defined that has the dates stored as
strings. If you read the data in from a csv file this is how R will keep track of the data. Note
that in this context the strings are assumed to represent ordinal data, and R will assume that
the data field is a set of factors. You have to use the strptime command to convert it into a
time field.
> myData <- data.frame(time=c("2014-01-23 14:28:21","2014-01-23 14:28:55",
"2014-01-23 14:29:02","2014-01-23 14:31:18"),
speed=c(2.0,2.2,3.4,5.5))
> myData
time speed
1 2014-01-23 14:28:21 2.0
2 2014-01-23 14:28:55 2.2
3 2014-01-23 14:29:02 3.4
4 2014-01-23 14:31:18 5.5
> summary(myData)
time speed
2014-01-23 14:28:21:1 Min. :2.000
2014-01-23 14:28:55:1 1st Qu.:2.150
2014-01-23 14:29:02:1 Median :2.800
2014-01-23 14:31:18:1 Mean :3.275
3rd Qu.:3.925
Max. :5.500
> myData$time[1]
[1] 2014-01-23 14:28:21
4 Levels: 2014-01-23 14:28:21 2014-01-23 14:28:55 ... 2014-01-23 14:31:18
> typeof(myData$time[1])
[1] "integer"
>
>
> myData$time <- strptime(myData$time,"%Y-%m-%d %H:%M:%S")
> myData
time speed
1 2014-01-23 14:28:21 2.0
2 2014-01-23 14:28:55 2.2
3 2014-01-23 14:29:02 3.4
4 2014-01-23 14:31:18 5.5
> myData$time[1]
[1] "2014-01-23 14:28:21"
> typeof(myData$time[1])
[1] "list"
> myData$time[1][[2]]
[1] 28
Now you can perform operations on the fields. For example you can determine the time
between observations. (Please see the notes below on time operations. This example is a bit
misleading!)
> N = length(myData$time)
> myData$time[2:N] - myData$time[1:(N-1)]
Time differences in secs
[1] 34 7 136
attr(,"tzone")
[1] ""
In addition to the time data types R also has a date data type. The difference is that the date
data type keeps track of numbers of days rather than seconds. You can cast a string into a
date type using the as.Date function. The as.Date function takes the same arguments as the
time data types discussed above.
> theDates <- c("1 jan 2012","1 jan 2013","1 jan 2014")
> dateFields <- as.Date(theDates,"%d %b %Y")
> typeof(dateFields)
[1] "double"
> dateFields
[1] "2012-01-01" "2013-01-01" "2014-01-01"
> N <- length(dateFields)
> diff <- dateFields[1:(N-1)] - dateFields[2:N]
> diff
Time differences in days
[1] -366 -365
You can also define a date in terms of the number days after another date using the origin
option.
> infamy <- as.Date(-179,origin="1942-06-04")
> infamy
[1] "1941-12-07"
>
> today <- Sys.Date()
> today
[1] "2014-01-23"
> today-infamy
Time difference of 26345 days
Finally, a nice function to know about and use is the format command. It can be used in a
wide variety of situations, and not just for dates. It is helpful for dates, though, because you
can use it in cat and print statements to make sure that your output is in exactly the form that
you want.
> theTime <- Sys.time()
> theTime
[1] "2014-01-23 16:15:05 EST"
> a <- rexp(1,0.1)
> a
[1] 7.432072
> cat("At about",format(theTime,"%H:%M"),"the time between occurances was
around",format(a,digits=3),"seconds\n")
At about 16:15 the time between occurances was around 7.43 seconds

15.2. Time Operations
The most difficult part of dealing with time data can be converting it into the right format.
Once a time or date is stored in Rs internal format then a number of basic operations are
available. The thing to keep in mind, though, is that the units you get after an operation can
vary depending on the magnitude of the time values. Be very careful when dealing with time
operations and vigorously test your codes.
> now <- Sys.time()
> now
[1] "2014-01-23 16:31:00 EST"
> now-60
[1] "2014-01-23 16:30:00 EST"
>
> earlier <- strptime("2000-01-01 00:00:00","%Y-%m-%d %H:%M:%S")
> later <- strptime("2000-01-01 00:00:20","%Y-%m-%d %H:%M:%S")
> later-earlier
Time difference of 20 secs
> as.double(later-earlier)
[1] 20
>
> later-earlier
Time difference of 1 hours
> as.double(later-earlier)
[1] 1
>
> up <- as.Date("1961-08-13")
> down <- as.Date("1989-11-9")
> down-up
Time difference of 10315 days
The two examples involving the variables earlier and later in the previous code sample
should cause you a little concern. The value of the difference depends on the largest units
with respect to the difference! The issue is that when you subtract dates R uses the equivalent
of the difftime command. We need to know how this operates to reduce the ambiguity when
comparing times.
> help(difftime)
>
> difftime(later,earlier)
Time difference of 1 hours
> difftime(later,earlier,units="secs")
One thing to be careful about difftime is that it is a double precision number, but it has units
attached to it. This can be tricky, and you should be careful about the ambiguity in using this
command. I personally always try to specify the units to avoid this.
> d <- difftime(later,earlier)
> d
> typeof(d)
[1] "double"
> as.double(d)
[1] 20
Another way to define a time difference is to use the as.difftime command. It takes two dates
and will compute the difference between them. It takes a time, its format, and the units to use.
Note that in the following example R is able to figure out what the units are when making the
calculation.
> diff <- as.difftime("00:30:00","%H:%M:%S",units="hour")
> diff
Time difference of 0.5 hours
> Sys.time()
[1] "2014-01-23 16:45:39 EST"
> Sys.time()+diff
[1] "2014-01-23 17:15:41 EST"
The last thing to mention is that once a time stamp is cast into one of Rs internal formats
comparisons can be made in a natural way.
> diff <- as.difftime("00:30:00","%H:%M:%S",units="hour")
> now <- Sys.time()
> later <- now + diff
> now
[1] "2014-01-23 16:47:48 EST"
> later
[1] "2014-01-23 17:17:48 EST"
>
> if(now < later)
{
cat("there you go\n")
}
there you go

16. Case Study: Working Through a HW
Problem
Contents
Problem Statement
Transforming the Data
The Confidence Interval
Test of Significance
The Power of the test
We look at a sample homework problem and the R commands necessary to explore the
problem. It is assumed that you are familiar will all of the commands discussed throughout
this tutorial.

16.1. Problem Statement
This problem comes from the 5th edition of Moore and McCabes Introduction to the Practice
of Statistics and can be found on pp. 466-467. The data consists of the emissions of three
different pollutants from 46 different engines. A copy of the data we use here is available.
The problem examined here is different from that given in the book but is motivated by the
discussion in the book.
In the following examples we will look at the carbon monoxide data which is one of the
columns of this data set. First we will transform the data so that it is close to being normally
distributed. We will then find the confidence interval for the mean and then perform a
significance test to evaluate whether or not the data is away from a fixed standard. Finally,
we will find the power of the test to detect a fixed difference from that standard. We will
assume that a confidence level of 95% is used throughout.

16.2. Transforming the Data
We first begin a basic examination of the data. A copy of the data file can be found at
table_7_3.csv.The first step is to read in the file and get a summary of the center and spread
of the data. In this instance we will focus only on the carbon monoxide data.
> engine <- read.csv(file="table_7_3.csv",sep=",",head=TRUE)
> names(engine)
[1] "en" "hc" "co" "nox"
> summary(engine)
en hc co nox
Min. : 1.00 Min. :0.3400 Min. : 1.850 Min. :0.490
1st Qu.:12.75 1st Qu.:0.4375 1st Qu.: 4.388 1st Qu.:1.110
Median :24.50 Median :0.5100 Median : 5.905 Median :1.315
Mean :24.00 Mean :0.5502 Mean : 7.879 Mean :1.340
Max. :46.00 Max. :1.1000 Max. :23.530 Max. :2.940
At first glance the carbon monoxide data appears to be skewed. The spread between the third
quartile and the max is five times the spread between the min and the first quartile. A boxplot
is show in Figure 1. showing that the data appears to be skewed. This is further confirmed in
the histogram which is shown in Figure 2. Finally, a normal qq plot is given in Figure 3. The
data does not appear to be normal.
> qqnorm(engine$co,main="Carbon Monoxide")
> qqline(engine$co)
> boxplot(engine$co,main="Carbon Monoxide")
> hist(engine$co,main="Carbon Monoxide")
> qqnorm(engine$co,main="Carbon Monoxide")
> qqline(engine$co)
>

Figure 1.
Boxplot of the Carbon Monoxide Data.

Figure 2.
Histogram of the Carbon Monoxide Data.

Figure 3.
Normal QQ Plot of the Carbon Monoxide Data.
We next see if the data can be transformed to something that is closer to being normally
distributed. We examine the logarithm of the data. First, the boxplot of the log of the data
appears to be more evenly distributed as shown in Figure 4. Also, the histogram appears to be
centered and closer to normal in Figure 5. Finally, the normal qq plot is shown in in Figure 6.
It shows that the data is more consistent with what we would expect from normal data.
> lengine <- log(engine$co)
> boxplot(lengine,main="Carbon Monoxide")
> hist(lengine,main="Carbon Monoxide")
> qqnorm(lengine,main="QQ Plot for the Log of the Carbon Monoxide")
> qqline(lengine)
>

Figure 4.
Boxplot of the Logarithm of the Carbon Monoxide Data.

Figure 5.
Histogram of the Logarithm of the Carbon Monoxide Data.

Figure 6.
Normal QQ Plot of the Logarithm of the Carbon Monoxide Data.
There is strong evidence that the logarithm of the carbon monoxide data more closely
resembles a normal distribution then does the raw carbon monoxide data. For that reason all
of the analysis that follows will be for the logarithm of the data and will make use of the new
list lengine.

16.3. The Confidence Interval
We now find the confidence interval for the carbon monoxide data. As stated above, we will
work with the logarithm of the data because it appears to be closer to a normal distribution.
This data is stored in the list called lengine. Since we do not know the true standard
deviation we will use the sample standard deviation and will use a t-distribution.
We first find the sample mean, the sample standard deviation, and the number of
observations:
> m <- mean(lengine)
> s <- sd(lengine)
> n <- length(lengine)
> m
[1] 1.883678
> s
[1] 0.5983851
> n
[1] 48
Now we find the standard error:
> se <- s/sqrt(n)
> se
[1] 0.08636945
Finally, the margin of error is found based on a 95% confidence level which can then be used
to define the confidence interval:
> error <- se*qt(0.975,df=n-1)
> error
[1] 0.1737529
> left <- m-error
> right <- m + error
> left
[1] 1.709925
> right
[1] 2.057431
The 95% confidence interval is between 1.71 and 2.06. Keep in mind that this is for the
logarithm so the 95% confidence interval for the original data can be found by undoing the
logarithm:
> exp(left)
[1] 5.528548
> exp(right)
[1] 7.82584
So the 95% confidence interval for the carbon monoxide is between 5.53 and 7.83.

16.4. Test of Significance
We now perform a test of significance. Here we suppose that ideally the engines should have
a mean level of 5.4 and do a two-sided hypothesis test. Here we assume that the true mean is
labeled x and state the hypothesis test:
Ho:x=5.4,
Ha:x5.4,
To perform the hypothesis test we first assume that the null hypothesis is true and find the
confidence interval around the assumed mean. Fortunately, we can use the values from the
previous step:
> lNull <- log(5.4) - error
> rNull <- log(5.4) + error
> lNull
[1] 1.512646
> rNull
[1] 1.860152
> m
[1] 1.883678
The sample mean lies outside of the assumed confidence interval so we can reject the null
hypothesis. There is a low probability that we would have obtained our sample mean if the
true mean really were 5.4.
Another way to approach the problem would be to calculate the actual p-value for the sample
mean that was found. Since the sample mean is greater than 5.4 it can be found with the
following code:
> 2*(1-pt((m-log(5.4))/se,df=n-1))
[1] 0.02692539
Since the p-value is 2.7% which is less than 5% we can reject the null hypothesis.
Note that there is yet another way to do this. The function t.test will do a lot of this work for
us.
> t.test(lengine,mu = log(5.4),alternative = "two.sided")

One Sample t-test

data: lengine
t = 2.2841, df = 47, p-value = 0.02693
alternative hypothesis: true mean is not equal to 1.686399
1.709925 2.057431
sample estimates:
mean of x
1.883678
More information and a more complete list of the options for this command can be found
using the help command:
> help(t.test)

16.5. The Power of the test
We now find the power of the test. To find the power we need to set a level for the mean and
then find the probability that we would accept the null hypothesis if the mean is really at the
prescribed level. Here we will find the power to detect a difference if the level were 7. Three
different methods are examined. The first is a method that some books advise to use if you do
not have a non-central t-test available. The second does make use of the non-central t-test.
Finally, the third method makes use of a customized R command.
We first find the probability of accepting the null hypothesis if the level really were 7. We
assume that the true mean is 7 and then find the probability that a sample mean would fall
within the confidence interval if the null hypothesis were true. Keep in mind that we have to
transform the level of 7 by taking its logarithm. Also keep in mind that this is a two-sided
test:
> tLeft <- (lNull-log(7))/(s/sqrt(n))
> tRight <- (rNull-log(7))/(s/sqrt(n))
> p <- pt(tRight,df=n-1) - pt(tLeft,df=n-1)
> p
[1] 0.1629119
> 1-p
[1] 0.8370881
So the probability of making a type II error is approximately 16.3%, and the probability of
detecting a difference if the level really is 7 is approximately 83.7%.
Another way to find the power is to use a non-centrality parameter. This is the method that
many books advise over the previous method. The idea is that you give it the critical t-values
associated with your test and also provide a parameter that indicates how the mean is shifted.
> t <- qt(0.975,df=n-1)
> shift <- (log(5.4)-log(7))/(s/sqrt(n))
> pt(t,df=n-1,ncp=shift)-pt(-t,df=n-1,ncp=shift)
[1] 0.1628579
> 1-(pt(t,df=n-1,ncp=shift)-pt(-t,df=n-1,ncp=shift))
[1] 0.8371421
Again, we see that the power of the test is approximately 83.7%. Note that this result is
slightly off from the previous answer. This approach is often recommended over the previous
approach.
The final approach we examine allows us to do all the calculations in one step. It makes use
of the non-centrality parameter as in the previous example, but all of the commands are done
for us.
> power.t.test(n=n,delta=log(7)-log(5.4),sd=s,sig.level=0.05,
type="one.sample",alternative="two.sided",strict = TRUE)

One-sample t test power calculation

n = 48
delta = 0.2595112
sd = 0.5983851
sig.level = 0.05
power = 0.8371421
alternative = two.sided
This is a powerful command that can do much more than just calculate the power of a test.
For example it can also be used to calculate the number of observations necessary to achieve
a given power. For more information check out the help page, help(power.t.test).

17. Case Study II: A JAMA Paper on
Cholesterol
Contents
Overview of the Paper
The Tables
Confirming the p-values in Table 3
Confirming the p-values in Table 4
Finding the Power of the Test in Table 3
Differences by Race in Table 2
Summary
We look at a paper that appeared in the Journal of the American Medical Association and
explore how to use R to confirm the results. It is assumed that you are familiar will all of the
commands discussed throughout this tutorial.

17.1. Overview of the Paper
The paper we examine is by Carroll et al. [Carroll2005] The goal is to confirm the results and
explore some of the other results not explicitly addressed in the paper. This paper received a
great deal of attention in the media. A partial list of some of the articles is given below, but
many of them are now defunct:
FOX News
www.medpagetoday.com
Argus Leader
The Globe and Mail
The authors examine the trends of several studies of cholesterol levels of Americans. The
studies have been conducted in 1960-1962, 1988-1994, 1976-1980, 1988-1994, and 1999-
2002. Studies of the studies previous to 1999 have indicated that overall cholesterol levels are
declining. The authors of this paper focus on the changes between the two latest studies,
1988-1994 and 1999-2002. They concluded that between certain populations cholesterol
levels have decreased over this time.
One of the things that received a great deal of attention is the linkage the authors drew
between lowered cholesterol levels and increased use of new drugs to lower cholesterol. Here
is a quote from their conclusions:
The increase in the proportion of adults using lipid-lowering medication, particularly in older age
groups, likely contributed to the decreases in total and LDL cholesterol levels observed.
Here we focus on the confirming the results listed in Tables 3 and 4 of the paper. We confirm
the p-values given in the paper and then calculate the power of the test to detect a prescribed
difference in cholesterol levels.

17.2. The Tables
Links to the tables in the paper are given below. Links are given to verbatim copies of the
tables. For each table there are two links. The first is to a text file displaying the table. The
second is to a csv file to be loaded into R. It is assumed that you have downloaded each of the
csv files and made them available.
Links to the Tables in the paper:
Table 1 text 1 csv 1

17.3. Confirming the p-values in Table 3
The first thing we do is confirm the p-values. The paper does not explicitly state the
hypothesis test, but they use a two-sided test as we shall soon see. We will explicitly define
the hypothesis test that the authors are using but first need to define some terms. We need the
means for the 1988-1994 and the 1999-2002 studies and will denote them M88 and M99
respectively. We also need the standard errors and will denote them SE88 and SE99
respectively.
In this situation we are trying to compare the means of two experiments and do not have
matched pairs. With this in mind we can define our hypothesis test:
H0:M88M99=0,
Ha:M88M990,
When we assume that the hypothesis test we calculate the p-values using the following
values:
SampleMean=M88M99,
SE=SE288+SE299.
Note that the standard errors are given in the data, and we do not have to use the number of
observations to calculate the standard error. However, we do need the number of observations
in calculating the p-value. The authors used a t test. There are complicated formulas used to
calculate the degrees of freedom for the comparison of two means, but here we will simply
find the minimum of the set of observations and subtract one.
We first need to read in the data from table3.csv and will call the new variable t3. Note that
we use a new option, row.names=group. This option tells R to use the entries in the
group column as the row names. Once the table has been read we will need to make use of
the means in the 1988-1994 study (t3$M.88) and the sample means in the 1999-2002 study
(t3$M.99). We will also have to make use of the corresponding standard errors (t3$SE.88 and
t3$SE.99) and the number of observations (t3$N.88 and t3$N.99).
> t3 <- read.csv(file="table3.csv",header=TRUE,sep=",",row.names="group")
> row.names(t3)
[1] "all" "g20" "men" "mg20" "m20-29" "m30-39" "m40-49" "m50-
59"
[9] "m60-74" "m75" "women" "wg20" "w20-29" "w30-39" "w40-49" "w50-
59"
[17] "w60-74" "w75"
> names(t3)
[1] "N.60" "M.60" "SE.60" "N.71" "M.71" "SE.71" "N.76" "M.76"
"SE.76"
[10] "N.88" "M.88" "SE.88" "N.99" "M.99" "SE.99" "p"
> t3$M.88
[1] 204 206 204 204 180 201 211 216 214 205 205 207 183 189 204 228 235
231
> t3$M.99
[1] 203 203 203 202 183 200 212 215 204 195 202 204 183 194 203 216 223
217
> diff <- t3$M.88-t3$M.99
> diff
[1] 1 3 1 2 -3 1 -1 1 10 10 3 3 0 -5 1 12 12 14
> se <- sqrt(t3$SE.88^2+t3$SE.99^2)
> se
[1] 1.140175 1.063015 1.500000 1.500000 2.195450 2.193171 3.361547
3.041381
[9] 2.193171 3.328663 1.131371 1.063015 2.140093 1.984943 2.126029
2.483948
[17] 2.126029 2.860070
> deg <- pmin(t3$N.88,t3$N.99)-1
> deg
[1] 7739 8808 3648 4164 673 672 759 570 970 515 4090 4643 960 860
753
[16] 568 945 552
We can now calculate the t statistic. From the null hypothesis, the assumed mean of the
difference is zero. We can then use the pt command to get the p-values.
> t <- diff/se
> t
[1] 0.8770580 2.8221626 0.6666667 1.3333333 -1.3664626 0.4559608
[7] -0.2974821 0.3287980 4.5596075 3.0042088 2.6516504 2.8221626
[13] 0.0000000 -2.5189636 0.4703604 4.8310181 5.6443252 4.8949852
> pt(t,df=deg)
[1] 0.809758825 0.997609607 0.747486382 0.908752313 0.086125089
0.675717245
[7] 0.383089952 0.628785421 0.999997110 0.998603837 0.995979577
0.997604809
[13] 0.500000000 0.005975203 0.680883135 0.999999125 0.999999989
0.999999354
There are two problems with the calculation above. First, some of the t-values are positive,
and for positive values we need the area under the curve to the right. There are a couple of
ways to fix this, and here we will insure that the t scores are negative by taking the negative
of the absolute value. The second problem is that this is a two-sided test, and we have to
multiply the probability by two:
> pt(-abs(t),df=deg)
[1] 1.902412e-01 2.390393e-03 2.525136e-01 9.124769e-02 8.612509e-02
[6] 3.242828e-01 3.830900e-01 3.712146e-01 2.889894e-06 1.396163e-03
[11] 4.020423e-03 2.395191e-03 5.000000e-01 5.975203e-03 3.191169e-01
[16] 8.748656e-07 1.095966e-08 6.462814e-07
> 2*pt(-abs(t),df=deg)
[1] 3.804823e-01 4.780786e-03 5.050272e-01 1.824954e-01 1.722502e-01
[6] 6.485655e-01 7.661799e-01 7.424292e-01 5.779788e-06 2.792326e-03
[11] 8.040845e-03 4.790382e-03 1.000000e+00 1.195041e-02 6.382337e-01
[16] 1.749731e-06 2.191933e-08 1.292563e-06
These numbers are a close match to the values given in the paper, but the output above is
hard to read. We introduce a new command to loop through and print out the results in a
format that is easier to read. The for loop allows you to repeat a command a specified number
of times. Here we want to go from 1, 2, 3, ..., to the end of the list of p-values and print out
the group and associated p-value:
> p <- 2*pt(-abs(t),df=deg)
> for (j in 1:length(p)) {
cat("p-value for ",row.names(t3)[j]," ",p[j],"\n");
}
p-value for all 0.3804823
p-value for g20 0.004780786
p-value for men 0.5050272
p-value for mg20 0.1824954
p-value for m20-29 0.1722502
p-value for m30-39 0.6485655
p-value for m40-49 0.7661799
p-value for m50-59 0.7424292
p-value for m60-74 5.779788e-06
p-value for m75 0.002792326
p-value for women 0.008040845
p-value for wg20 0.004790382
p-value for w20-29 1
p-value for w30-39 0.01195041
p-value for w40-49 0.6382337
p-value for w50-59 1.749731e-06
p-value for w60-74 2.191933e-08
p-value for w75 1.292563e-06
We can now compare this to Table 3 and see that we have good agreement. The differences
come from a round off errors from using the truncated data in the article as well as using a
different method to calculate the degrees of freedom. Note that for p-values close to zero the
percent errors are very large.
It is interesting to note that among the categories (rows) given in the table, only a small
number of the differences have a p-value small enough to reject the null hypothesis at the
95% level. The differences with a p-value less than 5% are the group of all people, men from
60 to 74, men greater than 74, women from 20-74, all women, and women from the age
groups of 30-39, 50-59, 60-74, and greater than 74.
The p-values for nine out of the eighteen categories are low enough to allow us to reject the
associated null hypothesis. One of those categories is for all people in the study, but very few
of the male categories have significant differences at the 95% level. The majority of the
differences are in the female categories especially the older age brackets.

17.4. Confirming the p-values in Table 4
We now confirm the p-values given in Table 4. The level of detail in the previous section is
not given, rather the commands are briefly given below:
> names(t4)
[1] "S88N" "S88M" "S88SE" "S99N" "S99M" "S99SE" "p"
> diff <- t4$S88M - t4$S99M
> se <- sqrt(t4$S88SE^2+t4$S99SE^2)
> deg <- pmin(t4$S88N,t4$S99N)-1
> t <- diff/se
> p <- 2*pt(-abs(t),df=deg)
> for (j in 1:length(p)) {
cat("p-values for ",row.names(t4)[j]," ",p[j],"\n");
}
p-values for MA 0.07724362
p-values for MAM 0.6592499
p-values for MAW 0.002497728
p-values for NHW 0.1184228
p-values for NHWM 0.2673851
p-values for NHWW 0.02585374
p-values for NHB 0.001963195
p-values for NHBM 0.003442551
p-values for NHBW 0.007932079
Again, the p-values are close to those given in Table 4. The numbers are off due to truncation
errors from the true data as well as a simplified calculation of the degrees of freedom. As in
the previous section the p-values that are close to zero have the greatest percent errors.

17.5. Finding the Power of the Test in Table 3
We now will find the power of the test to detect a difference. Here we arbitrarily choose to
find the power to detect a difference of 4 points and then do the same for a difference of 6
points. The first step is to assume that the null hypothesis is true and find the 95% confidence
interval around a difference of zero:
> se <- sqrt(t3$SE.88^2+t3$SE.99^2)
> deg <- pmin(t3$N.88,t3$N.99)-1
> tcut <- qt(0.975,df=deg)
> tcut
[1] 1.960271 1.960233 1.960614 1.960534 1.963495 1.963500 1.963094
1.964135
[9] 1.962413 1.964581 1.960544 1.960475 1.962438 1.962726 1.963119
1.964149
[17] 1.962477 1.964271
Now that the cutoff t-scores for the 95% confidence interval have been established we want
to find the probability of making a type II error. We find the probability of making a type II
error if the difference is a positive 4.
> typeII <- pt(tcut,df=deg,ncp=4/se)-pt(-tcut,df=deg,ncp=4/se)
> typeII
[1] 0.06083127 0.03573266 0.24009202 0.24006497 0.55583392 0.55508927
[7] 0.77898598 0.74064782 0.55477307 0.77573784 0.05765826 0.03576160
[13] 0.53688674 0.47884787 0.53218625 0.63753092 0.53199248 0.71316969
> 1-typeII
[1] 0.9391687 0.9642673 0.7599080 0.7599350 0.4441661 0.4449107 0.2210140
[8] 0.2593522 0.4452269 0.2242622 0.9423417 0.9642384 0.4631133 0.5211521
[15] 0.4678138 0.3624691 0.4680075 0.2868303
It looks like there is a mix here. Some of the tests have a very high power while others are
poor. Six of the categories have very high power and four have powers less than 30% . One
problem is that this is hard to read. We now show how to use the for loop to create a nicer
output:
> power <- 1-typeII
> for (j in 1:length(power)) {
cat("power for ",row.names(t3)[j]," ",power[j],"\n");
}

power for all 0.9391687
power for g20 0.9642673
power for men 0.759908
power for mg20 0.759935
power for m20-29 0.4441661
power for m30-39 0.4449107
power for m40-49 0.221014
power for m50-59 0.2593522
power for m60-74 0.4452269
power for m75 0.2242622
power for women 0.9423417
power for wg20 0.9642384
power for w20-29 0.4631133
power for w30-39 0.5211521
power for w40-49 0.4678138
power for w50-59 0.3624691
power for w60-74 0.4680075
power for w75 0.2868303

We see that the composite groups, groups made up of larger age groups, have much higher
powers than the age stratified groups. It also appears that the groups composed of women
seem to have higher powers as well.

17.6. Differences by Race in Table 2
We now look at the differences in the rows of LDL cholesterol in Table 2. Table 2 lists the
cholesterol levels broken down by race. Here the p-values are calculated for the means for the
totals rather than for every entry in the table. We will use the same two-sided hypothesis test
as above, the null hypothesis is that there is no difference between the means.
Since we are comparing means from a subset of the entries and across rows we cannot simply
copy the commands from above. We will first read the data from the files and then separate
the first, fourth, seventh, and tenth rows:
> t1 <- read.csv(file='table1.csv',head=TRUE,sep=',',row.name="Group")
> t1
Total HDL LDL STG
AllG20 8809 8808 3867 3982
MG20 4165 4164 1815 1893
WG20 4644 4644 2052 2089
AMA 2122 2122 950 994
AMA-M 998 998 439 467
AMA-F 1124 1124 511 527
ANHW 4338 4337 1938 1997
ANHW-M 2091 2090 924 965
ANHW-F 2247 2247 1014 1032
ANHB 1602 1602 670 674
ANHB-M 749 749 309 312
ANHB-W 853 853 361 362
M20-29 674 674 304 311
M30-39 673 673 316 323
M40-49 760 760 318 342
M50-59 571 571 245 262
M60-69 671 670 287 301
M70 816 816 345 354
W20-29 961 961 415 419
W30-39 861 861 374 377
W40-49 754 755 347 352
W50-59 569 569 256 263
W60-69 672 671 315 324
W70 827 827 345 354
> t2 <- read.csv(file='table2.csv',head=T,sep=',',row.name="group")
> ldlM <- c(t2$LDL[1],t2$LDL[4],t2$LDL[7],t2$LDL[10])
> ldlSE <- c(t2$LDLSE[1],t2$LDLSE[4],t2$LDLSE[7],t2$LDLSE[10])
> ldlN <- c(t1$LDL[1],t1$LDL[4],t1$LDL[7],t1$LDL[10])
> ldlNames <-
c(row.names(t1)[1],row.names(t1)[4],row.names(t1)[7],row.names(t1)[10])
> ldlM
[1] 123 121 124 121
> ldlSE
[1] 1.0 1.3 1.2 1.6
> ldlN
[1] 3867 950 1938 670
> ldlNames
[1] "AllG20" "AMA" "ANHW" "ANHB"
We can now find the approximate p-values. This is not the same as the previous examples
because the means are not being compared across matching values of different lists but down
the rows of a single list. We will make use of two for loops. The idea is that we will loop
though each row except the last row. Then for each of these rows we make a comparison for
every row beneath:
> for (j in 1:(length(ldlM)-1)) {
for (k in (j+1):length(ldlM)) {
diff <- ldlM[j]-ldlM[k]
se <- sqrt(ldlSE[j]^2+ldlSE[k]^2)
t <- diff/se
n <- min(ldlN[j],ldlN[k])-1
p <- 2*pt(-abs(t),df=n)
cat("(",j,",",k,") - ",ldlNames[j]," vs ",ldlNames[k]," p: ",p,"\n")
}
}
( 1 , 2 ) - AllG20 vs AMA p: 0.2229872
( 1 , 3 ) - AllG20 vs ANHW p: 0.5221284
( 1 , 4 ) - AllG20 vs ANHB p: 0.2895281
( 2 , 3 ) - AMA vs ANHW p: 0.09027066
( 2 , 4 ) - AMA vs ANHB p: 1
( 3 , 4 ) - ANHW vs ANHB p: 0.1340862
We cannot reject the null hypothesis at the 95% level for any of the differences. We cannot
say that there is a significant difference between any of the groups in this comparison.

17.7. Summary
We examined some of the results stated in the paper Trends in Serum Lipids and
Lipoproteins of Adults, 1960-2002 and confirmed the stated p-values for Tables 3 and 4. We
also found the p-values for differences by some of the race categories in Table 2.
When checking the p-values for Table 3 we found that nine of the eighteen differences are
significant. When looking at the boxplot (below) for the means from Table 3 we see that two
of those (the means for the two oldest categories for women) are outliers.
> boxplot(t3$M.60,t3$M.71,t3$M.76,t3$M.88,t3$M.99,
names=c("60-62","71-73","76-80","88-94","99-02"),
xlab="Studies",ylab="Serum Total Cholesterol",
main="Comparison of Serum Total Cholesterol By Study")

Figure 1.
Boxplot of the results in table 3.
The boxplot helps demonstrate what was found in the comparisons of the previous studies.
There has been long term declines in cholesterol levels over the whole term of these studies.
The new wrinkle in this study is the association of the use of statins to reduce blood serum
cholesterol levels. The results given in Table 6 show that every category of people has shown
a significant increase in the use of statins.
One question for discussion is whether or not the association demonstrates strong enough
evidence to claim a causative link between the two. The news articles that were published in
response to the article imply causation between the use of statins and lower cholesterol levels.
However, the study only demonstrates a positive association, and the decline in cholesterol
levels may be a long term phenomena with interactions between a wide range of factors.
Bibliography
[Carroll2005]
Trends in Serum Lipids and Lipoproteins of Adults, 1960-2002,
Margaret D. Carroll, MSPH, David A. Lacher, MD, Paul D. Sorlie, PhD, James
I. Cleeman, MD, David J. Gordon, MD, PhD, Michael Wolz, MS, Scott M.
Grundy, MD, PhD, Clifford L. Johnson, MSPH, Journal of the American
Medical Association, October 12, 2005Vol 294, No. 14, pp: 1773-1781
Index
Symbols | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | V | W | X | Y
Symbols
#
&
&&
*
+
-
/
3D plotting
<-
|
||
A
abline()
access list
access lists
addition
arrows()
as.array()
as.Date()
as.difftime()
as.factor()
as.POSIXct()
as.POSIXlt
as.table(), [1], [2]
assignment, [1]
attributes(), [1], [2], [3]
axes limits
axis(), [1]
B
bar plot
barplot()
binomial distribution
box()
boxplot()
break
bty
C
c(), [1]
case study, [1]
cat(), [1]
censoring data, [1]
colnames(), [1]
comments
confidence interval, [1]
cor()
chi-squared distribution
chi-squared test
csv
D
data frame, [1]
data.frame
date
dbinom()
dchisq()
density()
dev.list()
dev.new()
dev.next()
dev.prev()
dev.print()
dev.set()
devices
difftime()
dir()
discrete data, [1], [2]
distributions, [1]
division
dnorm()
dot chart
dt()
E
else
error bars
example, [1]
exp()
expression()
F
factor()
factors
FALSE
five point summary
fixed width files
for loop, [1]
for(), [1]
format()
function()
functions
G
graphing discrete data grid()
H
help()
hist(), [1]
histogram
horizontal boxplot
Hypothesis test, [1]
I
if
independent samples, [1], [2]
indexing vectors
is.factor()
is.na()
is.numeric()
J
JAMA example jitter()
K
keyboard input
L
lapply(), [1]
legend()
levels()
linear least squares regression
linear model
list
list access
list()
lm()
log()
logical
logical expressions
logical not
logical operators
logical vectors
ls(), [1]
M
margin.table()
matrix(), [1], [2]
max(), [1]
mean()
median()
min(), [1], [2]
missing values
mosaic plots
mosaicplot
mosaicplot()
multiple plots
multiple relationships
multiplication
N
NA
na.rm
names(), [1], [2], [3]
next
normal distribution
normal qq plot
not available
numeric, [1]
noise numeric()
O
one way table
P
p values
pairs()
par()
Pareto Chart
pbinom()
pchisq()
persp()
plot annotation
plot regression line
plot(), [1], [2]
plotting, [1]
plotting proportions
plotting tables
pmin(), [1], [2], [3]
png
pnorm(), [1]
points()
polygon(), [1]
POSIXct
POSIXlt
power, [1]
power.t.test(), [1]
print()
prop.table()
pt()
Q
qbinom()
qchisq()
qnorm()
qqline(), [1]
qqnorm(), [1]
qt()
quantile()
R
rbinom()
rchisq()
read.csv(), [1]
read.fwf()
regression
rejection region
repeat lupe
repeat()
residuals()
rnorm()
rownames()
rt()
rug()
S
sapply(), [1]
scan()
scatter plot
scripting
sd()
selective indexing
seq()
setwd()
shaded region
smoothScatter()
sort
source
sqrt(), [1]
strftime()
string
strip chart
stripchart()
strptime()
subtraction
sum(), [1]
summary(), [1], [2], [3]
switch
Sys.Date()
T
t()
t-distribution
t.test(), [1]
table(), [1], [2]
tables
tapply(), [1]
text()
tick marks
time
title()
transpose
TRUE
two way tables, [1]
typeof()
V
var() variables
W
while lupe while()
X
xlim
Y
ylim

R Tutorial

Uploaded by

R Tutorial

Uploaded by

R Tutorial

You might also like