0% found this document useful (0 votes)
4 views22 pages

DA_Lab_Week-2

The document outlines a Week-2 agenda focused on implementing data exploration and visualization using R, covering topics such as conditional statements, loops, functions, data structures, and graphics. It includes detailed explanations and examples of R programming concepts including vectors, lists, matrices, arrays, and data frames, as well as various plotting techniques. The document serves as a guide for exploring individual and multiple variables within datasets and saving visualizations.

Uploaded by

upesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
Download as doc, pdf, or txt
0% found this document useful (0 votes)
4 views22 pages

DA_Lab_Week-2

The document outlines a Week-2 agenda focused on implementing data exploration and visualization using R, covering topics such as conditional statements, loops, functions, data structures, and graphics. It includes detailed explanations and examples of R programming concepts including vectors, lists, matrices, arrays, and data frames, as well as various plotting techniques. The document serves as a guide for exploring individual and multiple variables within datasets and saving visualizations.

Uploaded by

upesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1/ 22

Week-2

Aim: Implement Data Exploration and Visualization on different Datasets to


explore multiple and Individual Variables.

Agenda:
1. R If ... Else
2. R While Loop
3. R Functions
a. Creating a Function
b. Arguments
4. Data Structures
a. Vectors
b. Lists
c. Matrices
d. Arrays
e. Data Frames
5. R Graphics
a. Plot
b. Line
c. Scatterplot
d. Pie Charts
e. Bars
6. R Statistics
7. Data Exploration and Visualization
a. Look at Data
b. Explore Individual Variables
c. Explore Multiple Variables
d. Save Charts into Files

1
R If ... Else

 Conditions and If Statements


 R supports the usual logical conditions from mathematics:

Operator Name Example


== Equal x == y
!= Not equal x != y
> Greater than x>y
< Less than x<y
>= Greater than or equal to x >= y
<= Less than or equal to x <= y

Example:
a <- 33
b <- 200
if (b > a) {
print("b is greater than a")
}
Output: "b is greater than a"

Note: R uses curly brackets { } to define the scope in the code.


Else If
Example
a <- 33
b <- 33

if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}

Output: "a and b are equal"

R Loops

Loops can execute a block of code as long as a specified condition is reached.

Loops are handy because they save time, reduce errors, and they make code more
readable.

R has two loop commands:

 while loops

2
 for loops

R While Loops
Example

#Print i as long as i is less than 6:

i <- 1
while (i < 6) {
print(i)
i <- i + 1
}

Note:
Break
 With the break statement, we can stop the loop even if the while condition is
TRUE.
Next
 With the next statement, we can skip an iteration without terminating the loop.

For Loops
A for loop is used for iterating over a sequence.

Example
for (x in 1:10) {
print(x)
}

Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

3
R Functions
 A function is a block of code which only runs when it is called.
 You can pass data, known as parameters, into a function.
 A function can return data as a result.
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() {
# create a function with the name my_function
print("Hello World!")
}
my_function() # call the function named my_function

Arguments
 Information can be passed into functions as arguments.
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}

my_function("Peter")
my_function("Lois")
my_function("Stewie")

Data Structures
Vectors

 A vector is simply a list of items that are of the same type.

 To combine the list of items to a vector, use the c() function and separate the
items by a comma.

Example
# Vector of strings
fruits <- c("banana", "apple", "orange")

# Print fruits
fruits

Example
# Vector of numerical values
numbers <- c(1, 2, 3)

# Print numbers
numbers

Example
# Vector of logical values
log_values <- c(TRUE, FALSE, TRUE, FALSE)

4
log_values

Vector Length

 To find out how many items a vector has, use the length() function.

Sort a Vector

 To sort items in a vector alphabetically or numerically, use the sort() function.

Access Vectors

You can access the vector items by referring to its index number inside brackets []. The
first item has index 1, the second item has index 2, and so on. # Ex. fruits[1]

You can also access multiple elements by referring to different index positions with the
c() function.

Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access the first and third item (banana and orange)


fruits[c(1, 3)]

Note:

1. Repeat Vectors: To repeat vectors, use the rep() function

2. The seq() function has three parameters: from is where the sequence starts, to is
where the sequence stops, and by is the interval of the sequence.

Lists

A list in R can contain many different data types inside it. A list is a collection of data
which is ordered and changeable.

To create a list, use the list() function

Example

# List of strings
thislist <- list("apple", "banana", "cherry")

# Print the list


thislist

Access Lists

5
You can access the list items by referring to its index number, inside brackets. The first
item has index 1, the second item has index 2, and so on.
Check if Item Exists
To find out if a specified item is present in a list, use the %in% operator.

Matrices

 A matrix is a two dimensional data set with columns and rows.

 A column is a vertical representation of data, while a row is a horizontal


representation of data.

 A matrix can be created with the matrix() function. Specify


the nrow and ncol parameters to get the amount of rows and columns.

Example
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
thismatrix

Access More Than One Row


 More than one row can be accessed if you use the c() function.
Ex. thismatrix[c(1,2),]
Access More Than One Column
Ex. thismatrix[, c(1,2)]
Add Rows and Columns
 Use the cbind() function to add additional columns in a Matrix.
 Use the rbind() function to add additional rows in a Matrix.
Remove Rows and Columns
Use the c() function to remove rows and columns in a Matrix.
#Remove the first row and the first column
thismatrix <- thismatrix[-c(1), -c(1)]
Number of Rows and Columns
Use the dim() function to find the number of rows and columns in a Matrix.

Arrays
 Compared to matrices, arrays can have more than two dimensions.
 We can use the array() function to create an array, and the dim parameter to
specify the dimensions.

Example
 # An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)

6
thisarray
# An array with more than one dimension
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray
Output:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
,,1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
,,2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
Data Frames

 Data Frames are data displayed in a format as a table.

 Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each
column should have the same type of data.

 Use the data.frame() function to create a data frame:

Example

# Create a data frame


Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Print the data frame


Data_Frame
Summarize the Data

Use the summary() function to summarize the data from a Data Frame:

Example

Data_Frame <- data.frame (


Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),

7
Duration = c(60, 30, 45)
)

Data_Frame

summary(Data_Frame)

Output:

Training Pulse Duration

1 Strength 100 60

2 Stamina 150 30

3 Other 120 45

Training Pulse Duration

Other :1 Min. :100.0 Min. :30.0

Stamina :1 1st Qu.:110.0 1st Qu.:37.5

Strength:1 Median :120.0 Median :45.0

Mean :123.3 Mean :45.0

3rd Qu.:135.0 3rd Qu.:52.5

Max. :150.0 Max. :60.0

Access Items

 We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a


data frame.

Example:
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[1]
Data_Frame[["Training"]]
Data_Frame$Training
Output:
Training
1 Strength
2 Stamina
3 Other
[1] Strength Stamina Other

8
Levels: Other Stamina Strength
[1] Strength Stamina Other
Levels: Other Stamina Strength

Plot
 The plot() function is used to draw points (markers) in a diagram.
 The function takes parameters for specifying points in the diagram.
 Parameter 1 specifies points on the x-axis.
 Parameter 2 specifies points on the y-axis.
Example
Draw one point in the diagram, at position (1) and position (3):
>plot(1, 3)

To draw more points, use


vectors:
Example
Draw two points in the diagram, one at
position (1, 3) and one in position (8, 10):
>plot(c(1, 8), c(3, 10))

9
Multiple Points
Example
>plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))

Draw a Line

The plot() function also takes a type parameter with the value l to draw a line to connect
all the points in the diagram:

plot(1:10, type="l") # yel (l) not one

10
Plot Labels
 The plot() function
also accept other
parameters, such as
main, xlab and ylab if you want to customize the graph with a main title and
different labels for the x and y-axis:
>plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")
Graph Appearance
Colors
Use col="color" to add a color to the points:
Example
plot(1:10, col="red")
Size
Use cex=number to change the size of the points (1 is default, while 0.5 means 50%
smaller, and 2 means 100% larger):
Example
plot(1:10, cex=2)
Point Shape
Use pch with a value from 0 to 25 to change the point shape format:
Example
plot(1:10, pch=25, cex=2)

11
Lines
 A line graph has a line that connects all the points in a diagram.
 To create a line, use the plot() function and add the type parameter with a value
of "l":
Example
plot(1:10, type="l")

Line Color
 The line color is black by default. To change the color, use the col parameter
Example
plot(1:10, type="l", col="blue")
Line Width
 To change the width of the line, use the lwd parameter (1 is default, while 0.5
means 50% smaller, and 2 means 100% larger).
Line Styles
 The line is solid by default. Use the lty parameter with a value from 0 to 6 to
specify the line format.
For example, lty=3 will display a dotted line instead of a solid line

Available parameter values for lty:

 0 removes the line


 1 displays a solid line

12
 2 displays a dashed line

 3 displays a dotted line

 4 displays a "dot dashed" line

 5 displays a "long dashed" line

 6 displays a "two dashed" line

Multiple Lines
 To display more than one line in a graph, use the plot() function together with the
lines() function.
line1 <- c(1,2,3,4,5,10)
line2 <- c(2,5,7,8,9,10)
plot(line1, type = "l", col = "blue")
lines(line2, type="l", col = "red")

Scatter Plot
 A "scatter plot" is a type of plot used to display the relationship between two
numerical variables, and plots one dot for each observation.
 It needs two vectors of same length, one for the x-axis (horizontal) and one for
the y-axis (vertical).
Example
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)

Compare Plots
To compare the plot with another plot, use the points() function:
Example
Draw two plots on the same figure:
# day one, the age and speed of 12 cars:
x1 <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y1 <- c(99,86,87,88,111,103,87,94,78,77,85,86)
# day two, the age and speed of 15 cars:
x2 <- c(2,2,8,1,15,8,12,9,7,3,11,4,7,14,12)
y2 <- c(100,105,84,105,90,99,90,95,94,100,79,112,91,80,85)

plot(x1, y1, main="Observation of Cars", xlab="Car age", ylab="Car speed",


col="red", cex=2)
points(x2, y2, col="blue", cex=2)

13
Pie Charts
 A pie chart is a circular graphical view of data.
 Use the pie() function to draw pie charts.
Example
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart
pie(x)
Start Angle
 You can change the start angle of the pie chart with the init.angle
parameter.
 The value of init.angle is defined with angle in degrees, where default
angle is 0.
Example
Start the first pie at 90 degrees:
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart and start the first pie at 90 degrees
pie(x, init.angle = 90)

Labels and Header


 Use the label parameter to add a label to the pie chart, and use the main
parameter to add a header.
Example
# Create a vector of pies
x <- c(10,20,30,40)
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
# Display the pie chart with labels
pie(x, label = mylabel, main = "Fruits")

Colors
 You can add a color to each pie with the col parameter.
Example
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Fruits", col = colors)

14
Legend
 To add a list of explanation for each pie, use the legend() function.
Example
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Pie Chart", col = colors)
# Display the explanation box
legend("bottomright", mylabel, fill = colors)

Bar Charts
 A bar chart uses rectangular bars to visualize data. Bar charts can be displayed
horizontally or vertically. The height or length of the bars are proportional to the
values they represent.
 Use the barplot() function to draw a vertical bar chart.
Example
# x-axis values
x <- c("A", "B", "C", "D")
# y-axis values
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, col = "red")

Density / Bar Texture


 To change the bar texture, use the density parameter.
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, density = 10)
Bar Width
 Use the width parameter to change the width of the bars
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, width = c(1,2,3,4))
Horizontal Bars
 If you want the bars to be displayed horizontally instead of vertically, use
horiz=TRUE.
Example

15
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, horiz = TRUE)

Statistics Introduction

 Statistics is the science of analyzing, reviewing and conclude data.

Some basic statistical numbers include:

 Mean, median and mode


 Minimum and maximum value

 Percentiles

 Variance and Standard Devation

 Covariance and Correlation

 Probability distributions

 The R language was developed by two statisticians. It has many built-in


functionalities, in addition to libraries for the exact purpose of statistical
analysis. (For
more information visit: https://www.w3schools.com/statistics/index.php)

Data Set
 A data set is a collection of data, often presented in a table.
 There is a popular built-in data set in R called "mtcars" (Motor Trend Car
Road Tests), which is retrieved from the 1974 Motor Trend US Magazine.
Example
# Print the mtcars data set
mtcars
Information About the Data Set
You can use the question mark (?) to get information about the mtcars data
set: ?mtcars
Get Information
 Use the dim() function to find the dimensions of the data set, and the
names() function to view the names of the variables:
Example
Data_Cars <- mtcars # create a variable of the mtcars data set for better
organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from the data set
names(Data_Cars)

16
 Use the rownames() function to get the name of each row in the first
column, which is the name of each car: rownames(Data_Cars)

From the examples above, we have found out that the data set has 32 observations
(Mazda RX4, Mazda RX4 Wag, Datsun 710, etc) and 11 variables (mpg, cyl, disp, etc).
 A variable is defined as something that can be measured or counted.
 Here is a brief explanation of the variables from the mtcars data set:

Variable Name Description


mpg Miles/(US) Gallon
cyl Number of cylinders
disp Displacement
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Print Variable Values


 If you want to print all values that belong to a variable, access the data frame by
using the $ sign, and the name of the variable (for example cyl (cylinders)):
Example
Data_Cars <- mtcars
Data_Cars$cyl
Sort Variable Values
To sort the values, use the sort() function: sort(Data_Cars$cyl)
Analyzing the Data
 use the summary() function to get a statistical summary of the data:
summary(Data_Cars).

The summary() function returns six statistical numbers for each variable:

1. Min
2. First quantile (percentile)

3. Median

4. Mean

17
5. Third quantile (percentile)

6. Max

Max Min

Example
#Find the largest and smallest value of the variable hp (horsepower).
Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)

For example, we can use the which.max() and which.min() functions to find the
index position of the max and min value in the table:

Example
Data_Cars <- mtcars
which.max(Data_Cars$hp)
which.min(Data_Cars$hp)

Or even better, combine which.max() and which.min() with the rownames() function to
get the name of the car with the largest and smallest horsepower:
Example
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]

Mean, Median, and Mode


In statistics, there are often three values that interests us:

1. Mean - The average value


2. Median - The middle value
3. Mode - The most common value
Mean
 To calculate the average value (mean) of a variable from the mtcars data set, find
the sum of all values, and divide the sum by the number of values.

Sorted observation of wt (weight)


1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

18
Example
Find the average weight (wt) of a car:
Data_Cars <- mtcars
mean(Data_Cars$wt)

Median
 The median value is the value in the middle, after you have sorted all the values.
Note: If there are two numbers in the middle, you must divide the sum of those
numbers by two, to find the median.
Example
#Find the mid point value of weight (wt):
Data_Cars <- mtcars
median(Data_Cars$wt)

Mode
 The mode value is the value that appears the most number of times.
Example: names(sort(-table(Data_Cars$wt)))[1]
Percentiles
 Percentiles are used in statistics to give you a number that describes the value
that a given percent of the values are lower than.
Example
Data_Cars <- mtcars
# c() specifies which percentile you want
quantile(Data_Cars$wt, c(0.75))
Note:
1. If you run the quantile() function without specifying the c() parameter, you will get
the percentiles of 0, 25, 50, 75 and 100.
2. Quartiles
a. Quartiles are data divided into four parts, when sorted in an ascending
order:
b. The value of the first quartile cuts off the first 25% of the data
c. The value of the second quartile cuts off the first 50% of the data
d. The value of the third quartile cuts off the first 75% of the data
e. The value of the fourth quartile cuts off the 100% of the data

8. Data Exploration and Visualization


a. Look at Data
b. Explore Individual Variables
c. Explore Multiple Variables
d. Save Charts into Files

19
Look at Data
 Note: The iris data is used in this for demonstration of data exploration with R.
 Execute the following commands and note the output for each and write the
purpose of the command in comments using #:
> dim(iris)
> names(iris)
> str(iris)
> attributes(iris)
> iris[1:5, ]
> head(iris)
> tail(iris)
> ## draw a sample of 5 rows
> idx <- sample(1:nrow(iris), 5)
> idx
> iris[idx, ]
> iris[1:10, "Sepal.Length"]
> iris[1:10, 1]
> iris$Sepal.Length[1:10]
Explore Individual Variables
 Execute the following commands and note the output for each and write the
purpose of the command in comments using #:
> summary(iris)
> quantile(iris$Sepal.Length)
> quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
> var(iris$Sepal.Length)
> hist(iris$Sepal.Length)
> plot(density(iris$Sepal.Length))
> table(iris$Species)
> pie(table(iris$Species))
EXPLORE MULTIPLE VARIABLES
 Execute the following commands and note the output for each and write the
purpose of the command in comments using #:

> barplot(table(iris$Species))
#calculate covariance and correlation between variables with cov() and cor().
> cov(iris$Sepal.Length, iris$Petal.Length)
> cov(iris[,1:4])
> cor(iris$Sepal.Length, iris$Petal.Length)
> cor(iris[,1:4])

20
> aggregate(Sepal.Length ~ Species, summary, data=iris)
>boxplot(Sepal.Length~Species, data=iris, xlab="Species", ylab="Sepal.Length")
> with(iris, plot(Sepal.Length, Sepal.Width, col=Species,
pch=as.numeric(Species)))
> ## same function as above
> # plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species,
pch=as.numeric(iris$Species))
> smoothScatter(iris$Sepal.Length, iris$Sepal.Width)
> pairs(iris)

More Explorations
> library(scatterplot3d)
> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
> library(rgl)
> plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
> distMatrix <- as.matrix(dist(iris[,1:4]))
> heatmap(distMatrix)
> library(lattice)
> levelplot(Petal.Width~Sepal.Length*Sepal.Width, iris, cuts=9,
+ col.regions=grey.colors(10)[10:1])
> filled.contour(volcano, color=terrain.colors, asp=1,
+ plot.axes=contour(volcano, add=T))
> persp(volcano, theta=25, phi=30, expand=0.5, col="lightblue")
> library(MASS)
> parcoord(iris[1:4], col=iris$Species)
> library(lattice)
> parallelplot(~iris[1:4] | Species, data=iris)
> library(ggplot2)
> qplot(Sepal.Length, Sepal.Width, data=iris, facets=Species ~.)

21
Save Charts into Files
> # save as a PDF file
> pdf("myPlot.pdf")
> x <- 1:50
> plot(x, log(x))
> graphics.off()
> #
> # Save as a postscript file
> postscript("myPlot2.ps")
> x <- -20:20
> plot(x, x^2)
> graphics.off()

22

You might also like