R Programming Course Notes
R Programming Course Notes
Xing Su
Contents
Overview and History of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sequence of Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Subsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Partial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Understanding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Split-Apply-Combine Funtions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
split() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
apply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
lapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
sapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
vapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
tapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
mapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
aggregate() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
17
18
Base Graphics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
19
Larger Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
19
20
Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
if - else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
while . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
22
22
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Scoping Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
25
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
R Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
History of S
Bell labs > insightful > Lucent > Alcatel-Lucent
in 1998, S won the Association for computing machinerys software system award
History of R
1991
1993
1995
1996
1997
2000
R Features
freedom
freedom
freedom
freedom
to
to
to
to
R Drawbacks
40 year-old technology
little built-in support for dynamic/3D graphics
functionality based on consumer demand
objects generally stored in physical memory (limited by hardware)
Coding Standards
[1] at the beginning of the output = which element of the vector is being shown
character
numeric
integer
complex
logical
Numbers
numbers generally treated as numeric objects (double precision real numbers - decimals)
Integer objects can be created by adding L to the end of a number(ex. 1L)
Inf = infinity, can be used in calculations
NaN = not a number/undefined
sqrt(value) = square root of value
Variables
variable <- value = assignment of a value to a variable name
Vectors and Lists
atomic vector = contains one data type, most basic object
vector <- c(value1, value2, ...) = creates a vector with specified values
vector1*vector2 = element by element multiplication (rather than matrix multiplication)
if the vectors are of different lengths, shorter vector will be recycled until the longer runs out
computation on vectors/between vectors (+, -, ==, /, etc.) are done element by element by
default
%*% = force matrix multiplication between vectors/matrices
vector("class", n) = creates empty vector of length n and specified class
vector("numeric", 3) = creates 0 0 0
c() = concatenate
T, F = shorthand for TRUE and FALSE
1+0i = complex numbers
explicit coercion
as.numeric(x), as.logical(x), as.character(x), as.complex(x) = convert object from one
class to another
nonsensible coercion will result in NA (ex. as.numeric(c(a, b))
as.list(data.frame) = converts a data.frame object into a list object
as.characters(list) = converts list into a character vector
implicit coercion
matrix/vector can only contain one data type, so when attempting to create matrix/vector with
different classes, forced coercion occurs to make every element to same class
least common denominator is the approach used (basically everything is converted to a class
that all values can take, numbers > characters) and no errors generated
coercion occurs to make every element to same class (implicit)
x <- c(NA, 2, "D") will create a vector of character class
list() = special vector wit different classes of elements
5
"1"
"cx"
NA
"2"
"dsa"
# convert to matrix
dim(x) <- c(3, 2)
class(x)
## [1] "matrix"
x
##
[,1]
## [1,] NA
## [2,] "1"
## [3,] "cx"
[,2]
NA
"2"
"dsa"
every element of the list must correspond in length to the dimensions of the array
dimnames(x) <- list(c("a", "b"), c("c", "d"), c("e", "f", "g", "h", "i"))
set the names for row, column, and third dimension respectively (2 x 2 x 5 in this case)
dim() function can be used to create arrays from vectors or matrices
x <- rnorm(20); dim(x) <- c(2, 2, 5) = converts a 20 element vector to a 2x2x5 array
Factors
Factors are used to represent categorical data (integer vector where each value has a label)
2 types: unordered vs ordered
treated specially by lm(), glm()
Factors easier to understand because they self describe (vs. 1 and 2)
factor(c("a", "b"), levels = c("1", "2")) = creates factor
levels() argument can be used to specify baseline levels vs other levels
Note:without explicit specification, R uses alphabetical order
table(factorVar) = how many of each are in the factor
Missing Values
NaN or NA = missing values
NaN = undefined mathematical operations
NA = any value not available or missing in the statistical sense
any operations with NA results in NA
NA can have different classes potentially (integer, character, etc)
Note: NaN is an NA value, but NA is not NaN
is.na(), is.nan() = use to test if each element of the vector is NA and NaN
Note: cannot compare NA (with ==) as it is not a value but a placeholder for a quantity that
is not available
sum(my_na) = sum of a logical vector (TRUE = 1 and FALSE = 0) is effectively the number of TRUEs
Removing NA Values
is.na() = creates logical vector where T is where value exists, F is NA
subsetting with the above result can return only the non NA elements
complete.cases(obj1, obj2) = creates logical vector where TRUE is where both values exist,
and FALSE is where any is NA
can be used on data frames as well
complete.cases(data.frame) = creates logical vectors indicating which observation/row is
good
data.frame[logicalVector, ] = returns all observations with complete data
Imputing Missing Values = replacing missing values with estimates (can be averages from all other
data with the similar conditions)
Sequence of Numbers
1:20 = creates a sequence of numbers from first number to second number
works in descending order as well
increment = 1
?: = enclose help for operators
seq(1, 20, by=0.5) = sequence 1 to 20 by increment of .5
length=30 argument can be used to specify number of values generated
length(variable) = length of vector/sequence
seq_along(vector) or seq(along.with = vector) = create vector that is same length as another
vector
rep(0, times = 40) = creates a vector with 40 zeroes
rep(c(1, 2), times = 10) = repeats combination of numbers 10 times
rep(c(1, 2), each = 10) = repeats first value 10 times followed by second value 10 times
Subsetting
R uses one based index > starts counting at 1
x[0] returns numeric(0), not error
x[3000] returns NA (not out of bounds/error)
[] = always returns object of same class, can select more than one element of an object [1;2]
[[]] = can extract one element from list or data frame, returned object not necessarily list/dataframe
$ = can extract elements from list/dataframe that have names associated with it, not necessarily same
class
Vectors
x[1:10] = first 10 elements of vector x
x[is.na(x)] = returns all NA elements
x[!is.na(x)] = returns all non NA elements
x > 0 = would return logical vector comparing all elements to 0 (TRUE/FALSE for all values except
for NA and NA for NA elements (NA a placeholder)
10
Lists
x <- list(foo = 1:4, bar = 0.6)
x[1] or x["foo"] = returns the list object foo
x[[2]] or x[["bar"]] or x$bar = returns the content of the second element from the list (in this case
vector without name attribute)
x[c(1, 3)] = extract multiple elements of list ([[]]
Note: $ cant extract multiple
x[[name]] = extract using variable, where as $ must match name of element
x[[c(1, 3)]] or x[[1]][[3]] = extracted nested elements of list third element of the first object
extracted from the list
Matrices
x[1, 2] = extract the (row, column) element
x[,2] or x[1,] = extract the entire column/row
x[ , 11:17] = subset the x data.frame with all rows, but only 11 to 17 columns
when an element from the matrix is retrieved, a vector is returned
behavior can be turned off (force return a matrix) by adding drop = FALSE
x[1, 2, drop = F]
Partial Matching
works with [[]] and $
$ automatically partial matches the name (x$a)
[[]] can partial match by adding exact = FALSE
x[[a, exact = false]]
11
Logic
Understanding Data
use class(), dim(), nrow(), ncol(), names() to understand dataset
object.size(data.frame) = returns how much space the dataset is occupying in memory
head(data.frame, 10), tail(data.frame, 10) = returns first/last 10 rows of data; default = 6
summary() = provides different output for each variable, depending on class,
for numerical variables, displays min max, mean median, etx.
for categorical (factor) variables, displays number of times each value occurs
table(data.frame$variable) = table of all values of the variable, and how many observations there
are for each
Note: mean for variables that only have values 1 and 0 = proportion of success
str(data.frame) = structure of data, provides data class, num of observations vs variables, and name
of class of each variable and preview of its contents
compactly display the internal structure of an R object
Whats in this object
well-suited to compactly display the contents of lists
View(data.frame) = opens and view the content of the data frame
12
Split-Apply-Combine Funtions
loop functions = convenient ways of implementing the Split-Apply-Combine strategy for data analysis
split()
takes a vector/objects and splits it into group b a factor or list of factors
split(x, f, drop = FALSE)
x = vector/list/data frame
f = factor/list of factors
drop = whether empty factor levels should be dropped
interactions(gl(2, 5), gl(5, 2)) = 1.1, 1.2, . . . 2.5
gl(n, m) = group level function
n = number of levels
m = number of repetitions
split function can do this by passing in list(f1, f2) in argument
split(data, list(gl(2, 5), gl(5, 2))) = splits the data into 1.1, 1.2, . . . 2.5 levels
apply()
x = array
MARGIN = 2 (column), 1 (row)
FUN = function
... = other arguments that need to be passed to other functions
examples
lapply()
loops over a list and evaluate a function on each element and always returns a list
Note: since input must be a list, it is possible that conversion may be needed
lapply(x, FUN, ...) = takes list/vector as input, applies a function to each element of the list,
returns a list of the same length
x = list (if not list, will be coerced into list through as.list, if not possible > error)
data.frame are treated as collections of lists and can be used here
FUN = function (without parentheses)
13
data = vector
INDEX = factor/list of factors
FUN = function
... = arguments to be passed to function
simplify = whether to simplify the result
example
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10); tapply(x, f, mean) = returns the mean of each group (f level) of x data
mapply()
multivariate apply, applies a function in parallel over a set of arguments
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)
FUN = function
... = arguments to apply over
MoreArgs = list of other arguments to FUN
SIMPLIFY = whether the result should be simplified
example
mapply(rep, 1:4, 4:1)
14
##
##
##
##
##
##
##
##
##
##
##
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
aggregate()
aggregate computes summary statistics of data subsets (similar to multiple tapply at the same time)
aggregate(list(name = dataToCompute), list(name = factorVar1,name = factorVar2),
function, na.rm = TRUE)
15
Simulation
sample(values, n, replace = FALSE) = generate random samples
r***
d***
p***
q***
function
function
function
function
(for
(for
(for
(for
If is the cumulative distribution function for a standard Normal distribution, then pnorm(q) = (q)
and qnorm(p) = 1 (q).
set.seed() = reproduce same data
Simulation Examples
rbinom(1, size = 100, prob = 0.7) = returns a binomial random variable that represents the
number of successes in a give number of independent trials
1 = corresponds number of observations
size = 100 = corresponds with the number of independent trials that culminate to each resultant
observation
prob = 0.7 = probability of success
rnorm(n, mean = m, sd = s) = generate n random samples from the standard normal distribution
(mean = 0, std deviation = 1 by default)
n = number of values
r = rate
ppois(n, r) = cumulative distribution
ppois(2, 2) = P r(x <= 2)
replicate(n, rpois()) = repeat operation n times
Generate Numbers for a Linear Model
Linear model
y = 0 + 1 x + where N (0, 22 ), x N (0, 12 ), 0 = 0.5, 1 = 2
set.seed(20)
x <- rnorm(100)
# normal
x <- rbinom(100, 1, 0.5) # binomial
e <- rnorm(100, 0, 2)
y <- 0.5 + 2* x + e
Poisson model
Y P oisson()log() = 0 + 1 x where 0 = 0.5, 1 = 2
x <- rnorm(100)
log.mu <- 0.5 + 0.3* x
y <- rpois(100, exp(log.mu))
17
Base Graphics
data(set) = load data
plot(data) = R plots the data as best as it can
x = variable, x axis
y = variable
xlab, ylab = corresponding labels
main, sub = title, subtitle
col = 2 or col = "red" = color
pch = 2 = different symbols for points
xlim,ylim(v1, v2) = restrict range of plot
boxplot(x ~ y, data = d) = creates boxplot for x vs y variables using the data.frame provided
hist(x, breaks) = plots histogram of the data
break = 100 = split data into 100 bins
18
read.table(), read.csv() = most common, read text files (rows, col) return data frame
readLines() = read lines of text, returns character vector
source(file) = read R code
dget() = read R code files (R objects that have been reparsed)
load(), unserialize() = read binary objects
writing data
write.table(), writeLines(), dump(), put(), save(), serialize()
read.table() arguments:
20
Control Structures
Common structures are
Note: Control structures are primarily useful for writing programs; for command-line interactive work,
the apply functions are more useful
if - else
# basic structure
if(<condition>) {
## do something
} else {
## do something else
}
# if tree
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
y <- if(x>3){10} else {0} = slightly different implementation than normal, focus on assigning value
for
# basic structure
for(i in 1:10) {
# print(i)
}
# nested for loops
x <- matrix(1:6, 2, 3)
for(i in seq_len(nrow(x))) {
for(j in seq_len(ncol(x))) {
# print(x[i, j])
}
}
seq_along() = to length of the vector
21
count <- 0
while(count < 10) {
# print(count)
count <- count + 1
}
conditions can be combined with logical operators
repeat and break
Repeat initiates an infinite loop
not commonly used in statistical applications but they do have their uses
The only way to exit a repeat loop is to call break
x0 <- 1
tol <- 1e-8
repeat {
x1 <- computeEstimate()
if(abs(x1 - x0) < tol) {
break
} else {
x0 <- x1 # requires algorithm to converge
}
}
Note: The above loop is a bit dangerous because theres no guarantee it will stop
Better to set a hard limit on the number of iterations (e.g. using a for loop) and then report
whether convergence was achieved or not.
next and return
next = (no parentheses) skips an element, to continue to the next iteration
return = signals that a function should exit and return a given value
for(i in 1:100) {
if(i <= 20) {
## Skip the first 20 iterations
next
}
## Do something here
}
22
Functions
name <- function(arg1, arg2, ...){ }
structure
f <- function(<arguments>) {
## Do something interesting
}
function are first class object and can be treated like other objects (pass into other functions)
functions can be nested, so that you can define a function inside of another function
function have named arguments (i.e. x = mydata) which can be used to specifiy default values
sd(x = mydata) (matching by name)
formal arguments = arguments included in the functional definition
23
Scoping
scoping rules determine how a value is associated with a free variable in a function
free variables = variables not explicitly defined in the function (not arguments, or local variables variable defined in the function)
R uses lexical/static scoping
common alternative = dynamic scoping
lexical scoping = values of free vars are searched in the environment in which the function is
defined
environment = collection of symbol/value pairs (x = 3.14)
each package has its own environment
only environment without parent environment is the empty environment
closure/function closure = function + associated environment
search order for free variable
1.
2.
3.
4.
5.
when a function/variable is called, R searches through the following list to match the first result
1.
2.
3.
4.
5.
6.
7.
8.
9.
.GlobalEnv
package:stats
package:graphics
package:grDeviced
package:utils
package:datasets
package:methods
Autoloads
package:base
order matters
.GlobalEnv = everything defined in the current workspace
any package that gets loaded with library() gets put in position 2 of the above search list
namespaces are separate for functions and non-functions
possible for object c and function c to coexist
Scoping Example
24
## [1] 27
square(3)
# defines x = 3
## [1] 9
# returns the free variables in the function
ls(environment(cube))
## [1] "n"
"pow"
## [1] 3
Lexical vs Dynamic Scoping Example
y <- 10
f <- function(x) {
y <- 2
y^2 + g(x)
}
g <- function(x) {
x*y
}
Lexical Scoping
1. f(3) > calls g(x)
2. y isnt defined locally in g(x) > searches in parent environment (working environment/global
workspace)
3. finds y > y = 10
Dynamic Scoping
1. f(3) > calls g(x)
2. y isnt defined locally in g(x) > searches in calling environment (f function)
3. find y > 2
parent frame = refers to calling environment in R, environment from which the function was
called
Note: when the defining environment and calling environment is the same, lexical and dynamic scoping
produces the same result
Consequences of Lexical Scoping
all objects must be carried in memory
all functions carry pointer to their defining environment (memory address)
25
Optimization
Optimization routines in R (optim, nlm, optimize) require you to pass a function whose argument is a
vector of parameters
Note: these functions minimize, so use the negative constructs to maximize a normal likelihood
Constructor functions = functions to be fed into the optimization routines
example
# write constructor function
make.NegLogLik <- function(data, fixed=c(FALSE,FALSE)) {
params <- fixed
function(p) {
params[!fixed] <- p
mu <- params[1]
sigma <- params[2]
a <- -0.5*length(data)*log(2*pi*sigma^2)
b <- -0.5*sum((data-mu)^2) / (sigma^2)
-(a + b)
}
}
# initialize seed and print function
set.seed(1); normals <- rnorm(100, 1, 2)
nLL <- make.NegLogLik(normals); nLL
## function(p) {
##
params[!fixed] <- p
##
mu <- params[1]
##
sigma <- params[2]
##
a <- -0.5*length(data)*log(2*pi*sigma^2)
##
b <- -0.5*sum((data-mu)^2) / (sigma^2)
##
-(a + b)
##
}
## <environment: 0x7fbf33a2dd80>
# Estimating Prameters
optim(c(mu = 0, sigma = 1), nLL)$par
##
mu
sigma
## 1.218239 1.787343
# Fixing sigma = 2
nLL <- make.NegLogLik(normals, c(FALSE, 2))
optimize(nLL, c(-1, 3))$minimum
## [1] 1.217775
# Fixing mu = 1
nLL <- make.NegLogLik(normals, c(1, FALSE))
optimize(nLL, c(1e-6, 10))$minimum
## [1] 1.800596
26
Debugging
message: generic notification/diagnostic message, execution continues
message() = generate message
warning: somethings wrong but not fatal, execution continues
warning() = generate warning
error: fatal problem occurred, execution stops
stop() = generate error
condition: generic concept for indicating something unexpected can occur
invisible() = suppresses auto printing
Note: random number generator must be controlled to reproduce problems (set.seed to pinpoint
problem)
traceback: prints out function call stack after error occurs
must be called right after error
debug: flags function for debug mode, allows to step through function one line at a time
debug(function) = enter debug mode
browser: suspends the execution of function wherever its placed
embedded in code and when the code is run, the browser comes up
trace: allows inserting debugging code into a function at specific places
recover: error handler, freezes at point of error
options(error = recover) = instead of console, brings up menu (simi)
R Profiler
optimizing code cannot be done without performance analysis and profiling
# system.time example
system.time({
n <- 1000
r <- numeric(n)
for (i in 1:n) {
x <- rnorm(n)
r[i] <- mean(x)
}
})
##
##
user
0.147
system elapsed
0.004
0.152
system.time(expression)
takes R expression, returns amount of time needed to execute (assuming you know where)
computes time (in sec) - gives time until error if error occurs
can wrap multiple lines of code with {}
returns object of class proc_time
user time = time computer experience
27
Good to break code into functions so profilers can give useful information about where time is spent
C/FORTRAN code is not profiled
Note: R must be compiled with profiles support (generally the case)
Note: should NOT be used with system.time()
Miscellaneous
unlist(rss) = converts a list object into data frame/vector
ls("package:elasticnet") = list methods in package
28