Introduction To R Programming Notes For Students
Introduction To R Programming Notes For Students
R data types:
Numeric
Character
Built in
R data structures
Vector
List
Matrices
Arrays
Factors
1. How to create factors?
2. How to access components of a factors?
Data frames
1. How to create dataframe in R?
2. How to access components of a data frame?
Using rbind() and Column bind cbind()/ Installing pakages
R flow controls (loops and if then else) / import data sets/ How to read csv files?
Pie chart
Bar chart
Box plot
Histogram
Histogram with added parameters
Line graphs
Scatter plots
Strip charts
R statistical functions
Basic statistical functions: mean,median,mode,average,min,max
Correlation and Linear regression , multilinear regression functions
ANOVA functions
Mode of teaching: Lab sessions
Introduction to R programming:
• It is one of the most popular languages used by statisticians, data analysts, researchers
and marketers to retrieve, clean, analyze, visualize and present data.
• Due to its expressive syntax and easy-to-use interface, it has grown in popularity in
recent years.
History of R
• Data Science
• Programming languages like R give a data scientist superpowers that allow them to
collect data in real-time, perform statistical and predictive analysis, create
visualizations and communicate actionable results to stakeholders.
• Statistical computing
• It has a rich package repository with more than 9100 packages with every statistical
function you can imagine
• Machine Learning
Alternatives of R programming:
SAS
SPSS
Python
Features of R
• Select a mirror
• Run the file and follow the steps in the instructions to install R.
R studio GUI
a. Features of RStudio
Code highlighting that gives different colors to keywords and variables, making it easier
to read
Automatic bracket matching
Code completion, so as to reduce the effort of typing the commands in full
Easy access to R Help, with additional features for exploring functions and parameters of
functions
Easy exploration of variables and values. RStudio is available free of charge for Linux,
Windows, and Mac devices. It can be directly accessed by clicking the RStudio icon in
the menu system on the desktop.
Because RStudio is available free of charge for Linux, Windows, and Mac devices, it is a
good option to use with R. To open RStudio, click the RStudio icon in the menu system or on
the desktop.
b. Components of RStudio
Source – Top left corner of the screen contains a text editor that lets the user work with
source script files. Multiple lines of code can also be entered here. Users can save R
script file to disk and perform other tasks on the script.
Console – Bottom left corner is the R console window. The console in RStudio is
identical to the console in RGui. All the interactive work of R programming is performed
in this window.
Workspace and History – The top right corner is the R workspace and history window.
This provides an overview of the workspace, where the variables created in the session
along with their values can be inspected. This is also the area where the user can see a
history of the commands issued in R.
Files, Plots, Package, and Help the bottom right corner gives access to the following tools:
Files – This is where the user can browse folders and files on a computer.
Plots – Now, this is where R displays the user’s plots.
Packages – This is where the user can view a list of all the installed packages.
Help – This is where you can browse the built-in Help system of R.
R reserved words
Comparison of R with other technologies:
Data handling Capabilities – Good data handling capabilities and options for parallel
computation.
Availability / Cost – R is an open source and we can use it anywhere.
Advancement in Tool – If you are working on latest technologies, R gets latest features.
Ease of Learning – R has a learning curve. R is a low-level programming language. As a
result, simple procedures can take long codes.
Job Scenario – It is a better option for start-ups and companies looking for cost
efficiency.
Graphical capabilities – R is having the most advanced graphical capabilities. Hence, it
provides you with advanced graphical capabilities.
Customer Service support and community – R is the biggest online growing
community.
Vectors:
A vector must have elements of the same type, this function will try and coerce elements to
the same type, if they are different.
Coercion is from lower to higher types from logical to integer to double to character.
Example 1:
Code:
x <- c(1, 5, 4, 9, 0)
typeof(x)
length(x)
Example:2
Code:
x <- c(1, 5.4, TRUE, "hello")
x
typeof(x)
If we want to create a vector of consecutive numbers, the : operator is very helpful.
Code:
X <- 1:7; x
y <- 2:-2; y
Vector index in R starts from 1, unlike most programming languages where index
start from 0.
We can use a vector of integers as index to access specific elements.
We can also use negative integers to return all elements except that those specified.
But we cannot mix positive and negative integers while indexing and real numbers, if
used, are truncated to integers.
Code:
[1] 0 2 4 6 8 10
[1] 4
[1] 2 6
[1] 2 4 6 8 10
Error in x[c(2, -4)] : only 0's may be mixed with negative subscripts
x[c(2.4, 3.54)] # real numbers are truncated to integers
[1] 2 4
When we use a logical vector for indexing, the position where the logical vector
is TRUE is returned.
This useful feature helps us in filtering of vector as shown below.
x[c(TRUE, FALSE, FALSE, TRUE)]
[1] -3 3
[1] -3 -1
x[x > 0]
[1] 3
This type of indexing is useful when dealing with named vectors. We can name each
elements of a vector.
names(x)
x["second"]
second
x[c("first", "third")]
first third
3 9
[1] -3 -2 -1 0 1 2
[1] -3 0 -1 0 1 2
[1] 5 0 5 0 1 2
[1] 5 0 5 0
[1] -3 -2 -1 0 1 2
x <- NULL
NULL
x[4]
NULL
Matrix:
*charcter constants
'example'
typeof("5")
*Numeric Constants
Types of operators
Arithmetic operators
+ / - / * / / / %% / %/% / ^
u <- c(2,3,4)
v <- c(9,8,7)
print (u+v)
b <- c(1,2,3)
c <- c(9,8,7)
print(b-c)
print (u-v)
print(v-u)
g <- c(1,2)
h <- c(2,3,4)
print(g*h)
g <- c(1,2,3)
h <- c(3,5,6)
print (g*h)
g <- c(1,2,3)
h <- c(3,5,6)
print (g%%h)
print(g %/% h)
print (g^h)
Built in constants
LETTERS
letters
pi
month.name
month.abb
Vector:
length(c(1, 5, 6, -2))
quantile(c(5,6,7))
sd(c(5,6,7,8))
max(c(5,6,7,8))
min(c(5,6,7,8))
sqrt(c(2, 4))
Mode function :
Mode
uniqv[which.max(tabulate(match(v, uniqv)))]
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
print(result)
Vector is a basic data structure in R. It contains element of the same type. The
data types can be logical, integer, double, character, complex or raw.
A vector’s type can be checked with the typeof() function.
Another important property of a vector is its length. This is the number of
elements in the vector and can be checked with the function length().
print(apple)
print(class(apple))
Bschools
print(class(Bschools))
List:
print(list1)
list
Matrix:
Create a matrix
Matrix can be created using the matrix() function.
Dimension of the matrix can be defined by passing appropriate value for
arguments nrow and ncol.
print(M)
matrix(1:9, nrow = 3)
colnames(x)
rownames(x)
rbind(c(1,2,3),c(4,5,6))
cbind(c('t','e','a','c','h'),c(1,2,3,4,5))
rbind(c('t','e','a','c','h'),c(1,2,3,4,5))
x[,] # leaving row as well as column field blank will select entire matrix
Factor is a data structure used for fields that takes only predefined, finite number of
values (categorical data).
For example: a data field such as marital status may contain only values from single,
married, separated, divorced, or widowed.In such case, we know the possible values
beforehand and these predefined, distinct values are called levels. Following is an
example of factor in R.
seeds_rice
print(factor_seeds)
print(nlevels(factor_seeds))
Data frames
Age = c(42,38,26)
print(BMI)
Min = c (23,12,13,5),
Max = c(23,45,45,65)
)
Print(Temp)
str(x) # structure of x
x["Name"]
x$Name
x[["Name"]]
x[[3]]
library(gtools)
smartbind(df1,df2)
emp_id = c (1:5),
emp_name = c("Ricky","Danish","Mini","Ryan","Gary"),
salary = c(643.3,515.2,671.0,729.0,943.25),
stringsAsFactors = FALSE
)
print(emp.data)
The structure of the data frame can see by using the str () function.
str(emp.data)
print(result)
print(result)
print(result)
v <- emp.data
print(v)
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
print(emp.finaldata)
Using Loops
Exercise 1:
for(i in 1:10)
Exercise:2
for(i in 1:10)
Exercise:3
if((num %% 2) == 0) {
print(paste(num,"is Even"))
} else {
print(paste(num,"is Odd"))
barplot(max.temp)
barplot(max.temp,
ylab = "Day",
col = "darkred",
horiz = TRUE)
table(age)
barplot(table(age),
xlab="Age",
ylab="Count",
border="red",
col="blue",
density=10
histogram
hist(Temperature)
added parameters
hist(Temperature,
xlim=c(50,100),
col="darkmagenta",
freq=FALSE
h <- hist(Temperature)
h <- hist(Temperature,ylim=c(0,40))
hist(Temperature,
xlim=c(50,100),
col="chocolate",
border="brown",
breaks=c(55,60,70,75,80,100)
bar plot
str(airquality)
boxplot(airquality$Ozone,
ylab = "Ozone",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
b <- boxplot(airquality$Ozone)
boxplot(Temp~Month,
data=airquality,
ylab="Degree Fahrenheit",
col="orange",
border="brown"
strip chart
str(airquality)
stripchart(airquality$Ozone)
stripchart(airquality$Ozone,
ylab="Ozone",
method="jitter",
col="orange",
pch=1
# make a list
x <- list("temp"=temp, "norm"=tempNorm)
stripchart(x,
xlab="Degree Fahrenheit",
ylab="Temperature",
method="jitter",
col=c("orange","red"),
pch=16
stripchart(Temp~Month,
data=airquality,
xlab="Months",
ylab="Temperature",
col="brown3",
group.names=c("May","June","July","August","September"),
vertical=TRUE,
pch=16
TYPES OF CHARTS
Data set
class.interval frequency
11.5-16.5 2
16.5-21.5 6
21.5-26.5 7
26.5-31.5 5
31.5-36.5 3
hist(CHARTS1$frequency,right = FALSE)
histogram
v <- c(9,13,21,8,36,22,12,41,31,33,19)
breaks = 5)
plot
v <- c(7,12,28,3,41)
plot(v,type = "o")
line chart
v <- c(7,12,28,3,41)
v <- c(7,12,28,3,41)
t <- c(14,7,6,19,3)
stem(CHARTS1$frequency)
> stem(CHARTS1$frequency)
2 | 00
4|0
6 | 00
dotchart(CHARTS$frequency)
SCATTER PLOT
plot(CHARTS1$frequency)
barplot(CHARTS1$frequency)
nrow=3,ncol=3,byrow=TRUE,
dimnames = list(c("A","B","C"),c("1947","1957","1967")))
colors <-c("darkblue","red","yellow")
xlab="Years", col=c("darkblue","red","yellow"),
nrow=3,ncol=3,byrow=TRUE,
dimnames = list(c("A","B","C"),c("1947","1957","1967")))
colors <-c("darkblue","red","yellow")
xlab="Years", col=c("darkblue","red","yellow"),
x <- seq(-pi,pi,0.1)
plot(x, sin(x))
plot(x, sin(x),
ylab="sin(x)")
plot(x, sin(x),
ylab="sin(x)",
type="l",
col="blue")
plot(x, sin(x),
main="Overlaying Graphs",
ylab="",
type="l",
col="blue")
lines(x,cos(x), col="red")
legend("topleft",
c("sin(x)","cos(x)"),
fill=c("blue","red")
barplot(max.temp, main="Barplot")
par(mfrow=c(2,2))
hist(Temperature)
boxplot(Temperature, horizontal=TRUE)
hist(Ozone)
boxplot(Ozone, horizontal=TRUE)
par(cex=0.7, mai=c(0.1,0.1,0.2,0.1))
par(fig=c(0.1,0.7,0.3,0.9))
hist(Temperature)
par(fig=c(0.8,1,0,1), new=TRUE)
boxplot(Temperature)
par(fig=c(0.1,0.67,0.1,0.25), new=TRUE)
stripchart(Temperature, method="jitter")
drawing a 3D plot
sqrt(x^2+y^2)
persp(x, y, z)
persp(x, y, z,
zlab = "Height",
mydata
output:
Anova
Analysis of Variance
Anova code:
n = rep(7, 3)
group = rep(1:3, n)
group
n = length(x))
anova(fit)
df = anova(fit)[, "Df"]
lower.tail = FALSE)
output:
16 | 8
17 | 6
18 | 28
19 | 17
20 | 1
15 | 9
16 | 4
17 | 47
18 | 47
19 | 1
15 | 29
16 | 57
17 | 17
18 | 8
>
> tmpfn = function(x) c(sum = sum(x), mean = mean(x), var = var(x),
+ n = length(x))
> tapply(y, group, tmpfn)
$`1`
sum mean var n
130.300000 18.614286 1.358095 7.000000
$`2`
sum mean var n
123.600000 17.657143 1.409524 7.000000
$`3`
sum mean var n
117.900000 16.842857 1.392857 7.000000
>
> data = data.frame(y = y, group = factor(group))
> fit = lm(y ~ group, data)
> anova(fit)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
group 2 11.007 5.5033 3.9683 0.03735 *
Residuals 18 24.963 1.3868
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> df = anova(fit)[, "Df"]
> names(df) = c("trt", "err")
> df
trt err
2 18
>
> anova(fit)["Residuals", "Sum Sq"]
[1] 24.96286
>
> anova(fit)["Residuals", "Sum Sq"]/qchisq(c(0.025, 0.975), 18,
+ lower.tail = FALSE)
[1] 0.7918086 3.0328790
Interpretation :
If the p value from the F test is greater than or equal to 0.05 then the null hyphothesis is accepted otherwise
rejected.
Correlation :
cor(CORRELATION, use="complete.obs", method="pearson")
CORRELATION
X Y
1 10 20
2 12 13
3 9 12
4 13 5
5 6 9
6 8 2
7 12 5
8 13 6
OUTPUT:
>
cor(CORRELATION, use="complete.obs", method="spearman")
CORRELATION
X Y
1 10 20
2 12 13
3 9 12
4 13 5
5 6 9
6 8 2
7 12 5
8 13 6
output:
X Y
10 20
12 13
9 12
13 5
6 9
8 2
12 5
13 6
Regression
>
REGRESSION
alligator = data.frame(
lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)
alligator #view data
summary(alligator_regression)
Call:
lm(formula = lnWeight ~ lnLength, data = alligator)
Coefficients:
(Intercept) lnLength
-8.476 3.431
>
> summary(alligator_regression)
Call:
lm(formula = lnWeight ~ lnLength, data = alligator)
Residuals: