DA_Lab_Week-2
DA_Lab_Week-2
Agenda:
1. R If ... Else
2. R While Loop
3. R Functions
a. Creating a Function
b. Arguments
4. Data Structures
a. Vectors
b. Lists
c. Matrices
d. Arrays
e. Data Frames
5. R Graphics
a. Plot
b. Line
c. Scatterplot
d. Pie Charts
e. Bars
6. R Statistics
7. Data Exploration and Visualization
a. Look at Data
b. Explore Individual Variables
c. Explore Multiple Variables
d. Save Charts into Files
1
R If ... Else
Example:
a <- 33
b <- 200
if (b > a) {
print("b is greater than a")
}
Output: "b is greater than a"
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}
R Loops
Loops are handy because they save time, reduce errors, and they make code more
readable.
while loops
2
for loops
R While Loops
Example
i <- 1
while (i < 6) {
print(i)
i <- i + 1
}
Note:
Break
With the break statement, we can stop the loop even if the while condition is
TRUE.
Next
With the next statement, we can skip an iteration without terminating the loop.
For Loops
A for loop is used for iterating over a sequence.
Example
for (x in 1:10) {
print(x)
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
3
R Functions
A function is a block of code which only runs when it is called.
You can pass data, known as parameters, into a function.
A function can return data as a result.
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() {
# create a function with the name my_function
print("Hello World!")
}
my_function() # call the function named my_function
Arguments
Information can be passed into functions as arguments.
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}
my_function("Peter")
my_function("Lois")
my_function("Stewie")
Data Structures
Vectors
To combine the list of items to a vector, use the c() function and separate the
items by a comma.
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
Example
# Vector of numerical values
numbers <- c(1, 2, 3)
# Print numbers
numbers
Example
# Vector of logical values
log_values <- c(TRUE, FALSE, TRUE, FALSE)
4
log_values
Vector Length
To find out how many items a vector has, use the length() function.
Sort a Vector
Access Vectors
You can access the vector items by referring to its index number inside brackets []. The
first item has index 1, the second item has index 2, and so on. # Ex. fruits[1]
You can also access multiple elements by referring to different index positions with the
c() function.
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
Note:
2. The seq() function has three parameters: from is where the sequence starts, to is
where the sequence stops, and by is the interval of the sequence.
Lists
A list in R can contain many different data types inside it. A list is a collection of data
which is ordered and changeable.
Example
# List of strings
thislist <- list("apple", "banana", "cherry")
Access Lists
5
You can access the list items by referring to its index number, inside brackets. The first
item has index 1, the second item has index 2, and so on.
Check if Item Exists
To find out if a specified item is present in a list, use the %in% operator.
Matrices
Example
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
thismatrix
Arrays
Compared to matrices, arrays can have more than two dimensions.
We can use the array() function to create an array, and the dim parameter to
specify the dimensions.
Example
# An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
6
thisarray
# An array with more than one dimension
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray
Output:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
,,1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
,,2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
Data Frames
Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each
column should have the same type of data.
Example
Use the summary() function to summarize the data from a Data Frame:
Example
7
Duration = c(60, 30, 45)
)
Data_Frame
summary(Data_Frame)
Output:
1 Strength 100 60
2 Stamina 150 30
3 Other 120 45
Access Items
Example:
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[1]
Data_Frame[["Training"]]
Data_Frame$Training
Output:
Training
1 Strength
2 Stamina
3 Other
[1] Strength Stamina Other
8
Levels: Other Stamina Strength
[1] Strength Stamina Other
Levels: Other Stamina Strength
Plot
The plot() function is used to draw points (markers) in a diagram.
The function takes parameters for specifying points in the diagram.
Parameter 1 specifies points on the x-axis.
Parameter 2 specifies points on the y-axis.
Example
Draw one point in the diagram, at position (1) and position (3):
>plot(1, 3)
9
Multiple Points
Example
>plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))
Draw a Line
The plot() function also takes a type parameter with the value l to draw a line to connect
all the points in the diagram:
10
Plot Labels
The plot() function
also accept other
parameters, such as
main, xlab and ylab if you want to customize the graph with a main title and
different labels for the x and y-axis:
>plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")
Graph Appearance
Colors
Use col="color" to add a color to the points:
Example
plot(1:10, col="red")
Size
Use cex=number to change the size of the points (1 is default, while 0.5 means 50%
smaller, and 2 means 100% larger):
Example
plot(1:10, cex=2)
Point Shape
Use pch with a value from 0 to 25 to change the point shape format:
Example
plot(1:10, pch=25, cex=2)
11
Lines
A line graph has a line that connects all the points in a diagram.
To create a line, use the plot() function and add the type parameter with a value
of "l":
Example
plot(1:10, type="l")
Line Color
The line color is black by default. To change the color, use the col parameter
Example
plot(1:10, type="l", col="blue")
Line Width
To change the width of the line, use the lwd parameter (1 is default, while 0.5
means 50% smaller, and 2 means 100% larger).
Line Styles
The line is solid by default. Use the lty parameter with a value from 0 to 6 to
specify the line format.
For example, lty=3 will display a dotted line instead of a solid line
12
2 displays a dashed line
Multiple Lines
To display more than one line in a graph, use the plot() function together with the
lines() function.
line1 <- c(1,2,3,4,5,10)
line2 <- c(2,5,7,8,9,10)
plot(line1, type = "l", col = "blue")
lines(line2, type="l", col = "red")
Scatter Plot
A "scatter plot" is a type of plot used to display the relationship between two
numerical variables, and plots one dot for each observation.
It needs two vectors of same length, one for the x-axis (horizontal) and one for
the y-axis (vertical).
Example
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)
Compare Plots
To compare the plot with another plot, use the points() function:
Example
Draw two plots on the same figure:
# day one, the age and speed of 12 cars:
x1 <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y1 <- c(99,86,87,88,111,103,87,94,78,77,85,86)
# day two, the age and speed of 15 cars:
x2 <- c(2,2,8,1,15,8,12,9,7,3,11,4,7,14,12)
y2 <- c(100,105,84,105,90,99,90,95,94,100,79,112,91,80,85)
13
Pie Charts
A pie chart is a circular graphical view of data.
Use the pie() function to draw pie charts.
Example
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart
pie(x)
Start Angle
You can change the start angle of the pie chart with the init.angle
parameter.
The value of init.angle is defined with angle in degrees, where default
angle is 0.
Example
Start the first pie at 90 degrees:
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart and start the first pie at 90 degrees
pie(x, init.angle = 90)
Colors
You can add a color to each pie with the col parameter.
Example
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Fruits", col = colors)
14
Legend
To add a list of explanation for each pie, use the legend() function.
Example
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Pie Chart", col = colors)
# Display the explanation box
legend("bottomright", mylabel, fill = colors)
Bar Charts
A bar chart uses rectangular bars to visualize data. Bar charts can be displayed
horizontally or vertically. The height or length of the bars are proportional to the
values they represent.
Use the barplot() function to draw a vertical bar chart.
Example
# x-axis values
x <- c("A", "B", "C", "D")
# y-axis values
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, col = "red")
15
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, horiz = TRUE)
Statistics Introduction
Percentiles
Probability distributions
Data Set
A data set is a collection of data, often presented in a table.
There is a popular built-in data set in R called "mtcars" (Motor Trend Car
Road Tests), which is retrieved from the 1974 Motor Trend US Magazine.
Example
# Print the mtcars data set
mtcars
Information About the Data Set
You can use the question mark (?) to get information about the mtcars data
set: ?mtcars
Get Information
Use the dim() function to find the dimensions of the data set, and the
names() function to view the names of the variables:
Example
Data_Cars <- mtcars # create a variable of the mtcars data set for better
organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from the data set
names(Data_Cars)
16
Use the rownames() function to get the name of each row in the first
column, which is the name of each car: rownames(Data_Cars)
From the examples above, we have found out that the data set has 32 observations
(Mazda RX4, Mazda RX4 Wag, Datsun 710, etc) and 11 variables (mpg, cyl, disp, etc).
A variable is defined as something that can be measured or counted.
Here is a brief explanation of the variables from the mtcars data set:
The summary() function returns six statistical numbers for each variable:
1. Min
2. First quantile (percentile)
3. Median
4. Mean
17
5. Third quantile (percentile)
6. Max
Max Min
Example
#Find the largest and smallest value of the variable hp (horsepower).
Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)
For example, we can use the which.max() and which.min() functions to find the
index position of the max and min value in the table:
Example
Data_Cars <- mtcars
which.max(Data_Cars$hp)
which.min(Data_Cars$hp)
Or even better, combine which.max() and which.min() with the rownames() function to
get the name of the car with the largest and smallest horsepower:
Example
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]
18
Example
Find the average weight (wt) of a car:
Data_Cars <- mtcars
mean(Data_Cars$wt)
Median
The median value is the value in the middle, after you have sorted all the values.
Note: If there are two numbers in the middle, you must divide the sum of those
numbers by two, to find the median.
Example
#Find the mid point value of weight (wt):
Data_Cars <- mtcars
median(Data_Cars$wt)
Mode
The mode value is the value that appears the most number of times.
Example: names(sort(-table(Data_Cars$wt)))[1]
Percentiles
Percentiles are used in statistics to give you a number that describes the value
that a given percent of the values are lower than.
Example
Data_Cars <- mtcars
# c() specifies which percentile you want
quantile(Data_Cars$wt, c(0.75))
Note:
1. If you run the quantile() function without specifying the c() parameter, you will get
the percentiles of 0, 25, 50, 75 and 100.
2. Quartiles
a. Quartiles are data divided into four parts, when sorted in an ascending
order:
b. The value of the first quartile cuts off the first 25% of the data
c. The value of the second quartile cuts off the first 50% of the data
d. The value of the third quartile cuts off the first 75% of the data
e. The value of the fourth quartile cuts off the 100% of the data
19
Look at Data
Note: The iris data is used in this for demonstration of data exploration with R.
Execute the following commands and note the output for each and write the
purpose of the command in comments using #:
> dim(iris)
> names(iris)
> str(iris)
> attributes(iris)
> iris[1:5, ]
> head(iris)
> tail(iris)
> ## draw a sample of 5 rows
> idx <- sample(1:nrow(iris), 5)
> idx
> iris[idx, ]
> iris[1:10, "Sepal.Length"]
> iris[1:10, 1]
> iris$Sepal.Length[1:10]
Explore Individual Variables
Execute the following commands and note the output for each and write the
purpose of the command in comments using #:
> summary(iris)
> quantile(iris$Sepal.Length)
> quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
> var(iris$Sepal.Length)
> hist(iris$Sepal.Length)
> plot(density(iris$Sepal.Length))
> table(iris$Species)
> pie(table(iris$Species))
EXPLORE MULTIPLE VARIABLES
Execute the following commands and note the output for each and write the
purpose of the command in comments using #:
> barplot(table(iris$Species))
#calculate covariance and correlation between variables with cov() and cor().
> cov(iris$Sepal.Length, iris$Petal.Length)
> cov(iris[,1:4])
> cor(iris$Sepal.Length, iris$Petal.Length)
> cor(iris[,1:4])
20
> aggregate(Sepal.Length ~ Species, summary, data=iris)
>boxplot(Sepal.Length~Species, data=iris, xlab="Species", ylab="Sepal.Length")
> with(iris, plot(Sepal.Length, Sepal.Width, col=Species,
pch=as.numeric(Species)))
> ## same function as above
> # plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species,
pch=as.numeric(iris$Species))
> smoothScatter(iris$Sepal.Length, iris$Sepal.Width)
> pairs(iris)
More Explorations
> library(scatterplot3d)
> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
> library(rgl)
> plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
> distMatrix <- as.matrix(dist(iris[,1:4]))
> heatmap(distMatrix)
> library(lattice)
> levelplot(Petal.Width~Sepal.Length*Sepal.Width, iris, cuts=9,
+ col.regions=grey.colors(10)[10:1])
> filled.contour(volcano, color=terrain.colors, asp=1,
+ plot.axes=contour(volcano, add=T))
> persp(volcano, theta=25, phi=30, expand=0.5, col="lightblue")
> library(MASS)
> parcoord(iris[1:4], col=iris$Species)
> library(lattice)
> parallelplot(~iris[1:4] | Species, data=iris)
> library(ggplot2)
> qplot(Sepal.Length, Sepal.Width, data=iris, facets=Species ~.)
21
Save Charts into Files
> # save as a PDF file
> pdf("myPlot.pdf")
> x <- 1:50
> plot(x, log(x))
> graphics.off()
> #
> # Save as a postscript file
> postscript("myPlot2.ps")
> x <- -20:20
> plot(x, x^2)
> graphics.off()
22