R Programming
R Programming
7) Write the purpose of prod () and round () functions. Give example for each.
*Ans: prod() Function
The prod() function calculates the product of all the elements in a numeric vector.
Example
# Create a numeric vector
x <- c(2, 3, 4)
product <- prod(x)
print(product) # Output: 24
round() Function
The round() function rounds a numeric value to the specified number of decimal places.
Example
# Create a numeric value
x <- 12.5678
rounded_x <- round(x, 2)
print(rounded_x) # Output: 12.57
10) What is ANOVA? Write the notations for Null and alternative hypothesis.
*Ans: ANOVA (Analysis of Variance) is a statistical technique used to compare the means of two or
more groups to determine if there is a significant difference between them. It is commonly used to
analyze the effect of one or more categorical variables (factors) on a continuous outcome variable.
* - Null Hypothesis (H0): μA = μB
- This states that the means of the two groups are equal.
- Alternative Hypothesis (H1): μA ≠ μB
- This states that the means of the two groups are not equal.
11) Write the purpose and syntax of lm ( ) function in R.
*Ans: Purpose of lm() function:
The lm() function in R is used to fit a linear regression model to a dataset. It estimates the
relationship between a continuous outcome variable (response variable) and one or more predictor
variables (explanatory variables).
Margins in a Plot
1. Margins: These are the areas outside the plotting region but within the figure region. There are
four margins:
- Bottom Margin (or x-axis margin): below the plotting region.
- Left Margin (or y-axis margin): to the left of the plotting region.
- Top Margin: above the plotting region.
- Right Margin: to the right of the plotting region.
Section-B
15) What is a file? Explain any four file handling functions inR.
*Ans: A file is a collection of data stored on a computer's storage device, such as a hard drive, solid-
state drive, or flash drive. Files can contain various types of data, including text, images, audio,
video, and executable programs. Each file has a unique name and is stored in a specific location on
the computer, such as a folder or directory.
File Handling Functions in R
Here are four commonly used file handling functions in R:
1. read.table(): Reads a file in table format and returns a data frame.
Example: data <- read.table("data.txt", header = TRUE)
2. write.table(): Writes a data frame to a file in table format.
Example: write.table(data, "data.txt", row.names = FALSE)
3. file.exists(): Checks if a file exists.
Example: file.exists("data.txt")
4. file.remove(): Deletes a file.
Example: file.remove("data.txt")
16) Compute the mean and median for the following observations: (9,5,2,3,4,6.7) Mention the R
functions for the sanme.
*Ans: To compute the mean and median for the given observations, we can use the following R
functions:
Mean
The R function to calculate the mean is mean().
Median
The R function to calculate the median is median().
Here's how to use these functions:
# Given observations
x <- c(9, 5, 2, 3, 4, 6.7)
# Calculate mean
mean_x <- mean(x)
print(paste("Mean:", mean_x))
# Calculate median
median_x <- median(x)
print(paste("Median:", median_x))
When you run this code, it will output the mean and median of the given observations.
17) Write a note on simple linear regression.
*Ans: Simple linear regression is a statistical method used to model the relationship between a
dependent variable (y) and a single independent variable (x). The goal is to create a linear equation
that best predicts the value of y based on the value of x.
Assumptions:
1. Linearity: The relationship between x and y is linear.
2. Independence: Each observation is independent of the others.
3. Homoscedasticity: The variance of the residuals is constant across all levels of x.
4. Normality: The residuals are normally distributed.
5. No multicollinearity: The independent variable is not highly correlated with itself.
Equation:
The simple linear regression equation is:
y = β0 + β1x + ε
where:
- y is the dependent variable
- x is the independent variable
- β0 is the intercept or constant term
- β1 is the slope coefficient
- ε is the error term
Section-C
18) a. Discuss the features of R programming.
Ans: *Cross-Platform Compatibility:
R is available on multiple platforms, including Windows, macOS, and Linux.
*Large Community and Support:
R has a large and active community of users, developers, and contributors. Users can seek help and
support through various online forums, mailing lists, and documentation.
*Extensive Documentation and Resources:
R has extensive documentation and resources, including the official R documentation, online
tutorials, books, and courses.
*Integration with Other Tools and Languages:
R can be easily integrated with other tools and languages, including Python, SQL, and Excel.
*Open-Source and Free:
R is open-source and free, making it accessible to users worldwide.
R Programming Structures: Control Statements - Loops, Looping Over Non-vector Sets, If-Else,
Arithmetic and Boolean Operators and values, Default Values for Argument, Return Values -
Deciding Whether to explicitly call return, Returning Complex Objects, Functions are Objective, No
Pointers in R, Recursion - A Quick sort Implementation, Extended Extended Example: A Binary
Search Tree.
Control Statements: The statements in an R program are executed sequentially from the top of the
program to the bottom. But some statements are to be executed repetitively, while only executing other
statements if certain conditions are met. R has the standard control structures.
Loops: Looping constructs repetitively execute a statement or series of statements until a condition isn‘t
true. These include the for, while and repeat structures with additional clauses break and next.
1) FOR :- The for loop executes a statement repetitively until a variable‘s
value is no longer contained in the sequence seq.
The syntax is for (var in sequence)
{
statement
}
Here, sequence is a vector and var takes on each of its value
during the loop. In each iteration, statement is evaluated.
for (n in x) { - - - }
It means that there will be one iteration of the loop for each
component of the vector x, with taking on the values of those
components—in the first iteration, n = x[1]; in the second
iteration, n = x[2]; and so on.
In this example for (i in 1:10) print("Hello") the word Hello is
printed 10 times.
Square of every element in a vector:
> x <- c(5,12,13)
> for (n in x) print(n^2)
[1] 25
[1] 144
[1] 169
The statements inside the loop are executed and the flow returns to evaluate
the expression again.
This is repeated each time until expression evaluates to FALSE, in which case, the loop exits.
Example
> i <- 1
> while (i<=10) i <- i+4
>i
[1] 13
5) Repeat:- Repeat loop is used to iterate over a block of code multiple number of times. There is no
condition check in repeat loop to exit the loop.
STATISTICS WITH R PROGRAMMING Unit - 2
We must ourselves put a condition explicitly inside the body of the loop and use the break statement to
exit the loop. Failing to do so will result into an infinite loop.
Syntax:
repeat
{
statement
}
Example:
x <- 1
repeat Output:
{ [1] 1
print(x) [1] 2
x = x+1 [1] 3
if (x == 6) [1] 4
break [1] 5
}
Looping Over Non-vector Sets:- R does not directly support iteration over nonvector sets, but there
are a couple of indirect yet easy ways to accomplish it:
apply( ):- Applies on 2D arrays (matrices), data frames to find aggregate functions like
sum, mean, median, standard deviation.
syntax:- apply(matrix,margin,fun, ..... )
margin = 1 indicates row
= 2 indicates col
> x <- matrix(1:20,nrow = 4,ncol=5)
>x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
> apply(x,2,sum)
[1] 10 26 42 58 74
Use lapply( ), assuming that the iterations of the loop are independent of each other, thus
allowing them to be performed in any order. Lapply( ) can be applies on dataframes,lists
and vectors and return a list. lapply returns a list of the same length as X, each element of
which is the result of applying FUN to the corresponding element of X.
Syntax: lapply(X, FUN, ...)
> x <- matrix(1:4,2,2)
>x
[,1] [,2] lapply( ) function is applied on every
[1,] 1 3 elements of the object.
[2,] 2 4
> lapply(x,sqrt)
[[1]]
[1] 1
[[2]]
[1] 1.414214
[[3]]
[1] 1.732051
STATISTICS WITH R PROGRAMMING Unit - 2
[[4]]
[1] 2
Use get( ), As its name implies, this function takes as an argument a character string
representing the name of some object and returns the object of that name. It sounds simple,
but get() is a very powerful function.
Syntax: get(“character string”)
> get("sum")
function (..., na.rm = FALSE) .Primitive("sum")
> get("g")
function(x)
{
return(x+1)
}
> get("num")
[1] "45"
Note:- Reserved words in R programming are a set of words that have special meaning and cannot be
used as an identifier (variable name, function name etc.).This list can be viewed by
typing help(reserved) or ?reserved at the R command prompt.
Reserved words in R
If –Else:- The if-else control structure executes a statement if a given condition is true. Optionally, a
different statement is executed if the condition is false.
The syntax is
if (cond) if (cond)
{ {
statements statement1
} } else
{
statement2
}
STATISTICS WITH R PROGRAMMING Unit - 2
x <- 8
if(x>3) {y <- 10 Output:
} else {y<-0} [1] 10
print(y)
An if-else statement works as a function call, and as such, it returns the last value assigned.
v <- if (cond) expression1 else expression2
This will set v to the result of expression1 or expression2, depending on whether cond is true. You
can use this fact to compact your code. Here‘s a simple example:
> x <- 2
> y <- if(x == 2) x else x+1
>y
[1] 2
> x <- 2
>if(x == 2) y <- x else y <- x+1
>y
[1] 2
1. Arithmetic operators:- These operators are used to carry out mathematical operations like addition
and multiplication. Here is a list of arithmetic operators available in R.
Operator Description Example
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
+ Adds two vectors print(v+t)
it produces the following result −
[1] 10.0 8.5 10.0
v <- c( 2,5.5,6)
Subtracts second t <- c(8, 3, 4)
−
vector from the first print(v-t)
it produces the following result −
STATISTICS WITH R PROGRAMMING Unit - 2
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
Multiplies both print(v*t)
*
vectors
it produces the following result −
[1] 16.0 16.5 24.0
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
Divide the first vector
/ print(v/t)
with the second
When we execute the above code, it produces the following result −
[1] 0.250000 1.833333 1.500000
v <- c( 2,5.5,6)
Give the remainder of t <- c(8, 3, 4)
%% the first vector with print(v%%t)
the second it produces the following result −
[1] 2.0 2.5 2.0
v <- c( 2,5.5,6)
The result of division t <- c(8, 3, 4)
%/% of first vector with print(v%/%t)
second (quotient) it produces the following result −
[1] 0 1 1
v <- c( 2,5.5,6)
The first vector raised t <- c(8, 3, 4)
^ to the exponent of print(v^t)
second vector it produces the following result −
[1] 256.000 166.375 1296.000
2. Relational Operator:- Relational operators are used to compare between values.Each element of the
first vector is compared with the corresponding element of the second vector. The result of comparison
is a Boolean value.
v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
> greater than the corresponding element of the print(v>t)
second vector. it produces the following result −
[1] FALSE TRUE FALSE FALSE
v <- c(2,5.5,6,9)
Checks if each element of the first vector is less t <- c(8,2.5,14,9)
< than the corresponding element of the second print(v < t)
vector. it produces the following result −
[1] TRUE FALSE TRUE FALSE
STATISTICS WITH R PROGRAMMING Unit - 2
v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
== equal to the corresponding element of the print(v == t)
second vector. it produces the following result −
[1] FALSE FALSE FALSE TRUE
v <- c(2,5.5,6,9)
Checks if each element of the first vector is less t <- c(8,2.5,14,9)
<= than or equal to the corresponding element of print(v<=t)
the second vector. it produces the following result −
[1] TRUE FALSE TRUE TRUE
v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
>= greater than or equal to the corresponding print(v>=t)
element of the second vector. it produces the following result −
[1] FALSE TRUE FALSE TRUE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is print(v!=t)
!= unequal to the corresponding element of the
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE
3) Logical Operators:- It is applicable only to vectors of type logical, numeric or complex. Zero is
considered FALSE and non-zero numbers are taken as TRUE. Each element of the first vector is
compared with the corresponding element of the second vector. The result of comparison is a Boolean
value.
Operator Description Example
v <- c(3,0,TRUE,2+2i)
It is called Element-wise Logical OR operator. t <- c(4,0,FALSE,2+3i)
It combines each element of the first vector print(v|t)
| with the corresponding element of the second
vector and gives a output TRUE if one the it produces the following result −
elements is TRUE. [1] TRUE FALSE TRUE TRUE
v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes each print(!v)
! element of the vector and gives the opposite
logical value. it produces the following result −
[1] FALSE TRUE FALSE FALSE
STATISTICS WITH R PROGRAMMING Unit - 2
The logical operator && and || considers only the first element of the vectors and give a vector
of single element as output.
Operator Description Example
v <- c(3,0,TRUE,2+2i)
Called Logical AND operator. Takes t <- c(1,3,TRUE,2+3i)
first element of both the vectors and print(v&&t)
&&
gives the TRUE only if both are
it produces the following result −
TRUE.
[1] TRUE
v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
v3 = c(3,1,TRUE,2+3i)
<− print(v1)
or print(v2)
= Called Left Assignment print(v3)
or it produces the following result −
<<− [1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
c(3,1,TRUE,2+3i) -> v1
c(3,1,TRUE,2+3i) ->> v2
-> print(v1)
print(v2)
or Called Right Assignment
->> it produces the following result −
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
5) Miscellaneous Operators:- These operators are used to for specific purpose and not general
mathematical or logical computation.
Operator Description Example
This v1 <- 8
operator is v2 <- 12
%in%
used to t <- 1:10
identify if print(v1 %in% t)
STATISTICS WITH R PROGRAMMING Unit - 2
Functions are assigned to objects just like any other variable, using <- operator.
Function are a set of parentheses that can either be empty – not have any arguments – or
contain any number of arguments.
The body of the function is enclosed in curly braces ({ and }).This is not necessary if the
function contains only one line.
A semicolon(;) can be used to indicate the end of the line but is not necessary.
return(k)
}
> oddcount(c(1,3,5))
[1] 3
> oddcount(c(1,2,3,7,9))
[1] 4
Variables created outside functions are global and are available within functions as well. Example:
> f <- function(x) return(x+y)
> y <- 3
> f(5)
[1] 8
Here y is a global variable. A global variable can be written to from within a function by using R‘s
superassignment operator, <<-.
Default Arguments:- R also makes frequent use of default arguments. Consider a function definition
like this:
> g <- function(x,y=2,z=T) { ... }
Here y will be initialized to 2 if the programmer does not specify y in the call. Similarly, z will have the
default value TRUE. Now consider this call:
> g(12,z=FALSE)
Here, the value 12 is the actual argument for x, and we accept the default value of 2 for y, but we
override the default for z, setting its value to FALSE.
Lazy Evaluation of Function:- Arguments to functions are evaluated lazily, which means so they are
evaluated only when needed by the function body.
# Create a function with arguments.
new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}
# Evaluate the function without supplying one of the arguments.
new.function(6)
When we execute the above code, it produces the following result −
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default
Return Values:- Functions are generally used for computing some value, so they need a mechanism to
supply that value back to the caller.This is called returning.
STATISTICS WITH R PROGRAMMING Unit - 2
The return value of a function can be any R object. Although the return value is often a list, it
could even be another function.
You can transmit a value back to the caller by explicitly calling return(). Without this call, the
value of the last executed statement will be returned by default.
If the last statement in the call function is a for( ) statement, which returns the value NULL.
# First build it without an explicit return # Now build it with an explicit return
num <- function(x) num <- function(x)
{ {
x*2 return(x*2)
} }
> num(5) > num(5)
[1] 10 [1] 10
# build it again, this time with another argument after the explicit return
num <- function(x)
{
return(x*2)
#below here is not executed because the return function already exists.
print("VISHNU")
return(17)
}
> num(5)
[1] 10
# if the last statement is for loop or any empty statement then it return NULL.
num <- function(x)
{ }
> num(5)
[1] NULL
Deciding Whether to Explicitly Call return():-The R idiom is to avoid explicit calls to return(). One of
the reasons cited for this approach is that calling that function lengthens execution time. However,
unless the function is very short, the time saved is negligible, so this might not be the most compelling
reason to refrain from using return(). But it usually isn‘t needed.
#Example to count the odd numbers with no return statement
oddcount <- function(x) {
k <- 0 Both programs
for (n in x) { results in same
if (n %% 2 == 1) k <- k+1 output with and
}
k without return
}
> oddcount(c(12,2,5,9,7))
[1] 3
Good software design, can glance through a function‘s code and immediately spot the various
points at which control is returned to the caller. The easiest way to accomplish this is to use an
explicit return() call in all lines in the middle of the code that cause a return.
Returning Complex Objects:- The return value can be any R object, you can return complex objects.
Here is an example of a function being returned:
g<-function() {
x<- 3
t <- function(x) return(x^2)
return(t)
}
> g()
function(x) return(x^2)
<environment: 0x16779d58>
If your function has multiple return values, place them in a list or other container.
Functions are Objective:- R functions are first-class objects (of the class "function"), meaning that they
can be used for the most part just like other objects. This is seen in the syntax of function creation:
g <- function(x)
{
return(x+1)
}
function( ) is a built-in R function whose job is to create functions.On the right-hand side, there are really
two arguments to function(): The first is the formal argument list for the function i.e, x and the second is
the body of that function return(x+1). That second argument must be of class "expression". So, the point
is that the right-hand side creates a function object, which is then assigned to g.
> ?"{" Its job is the make a single unit of what could be several statements.
formals( ) :- Get or set the formal arguments of a function
> formals(g) # g is a function with formal arguments ―x‖
$x
body( ) :- Get or set the body of a function
> body(g) # g is a function
{
x <- x + 1
return(x)
}
Replacing body of the function: quote( ) is used to substitute expression
> g <- function(x) return(x+1)
> body(g) <- quote(2*x+3)
>g
function (x)
2*x+3
Typing the name of an object results in printing that object to the screen which is similar to
all objects
>g
function(x)
{
return(x+1)
}
Printing out a function is also useful if you are not quite sure what an R library function
does. Code of a function is displayed by typing the built-in function name.
> sd
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
<bytecode: 0x17b49740> <environment: namespace:stats>
STATISTICS WITH R PROGRAMMING Unit - 2
Some of R‘s most fundamental built-in functions are written directly in C, and thus they
are not viewable in this manner.
> sum
function (..., na.rm = FALSE) .Primitive("sum")
Since functions are objects, you can also assign them, use them as arguments to other
functions, and so on.
> f1 <- function(a,b) return(a+b)
> f2 <- function(a,b) return(a-b)
> f <- f1 # Assigning function object to other object
> f(3,2)
[1] 5
> g <- function(h,a,b) h(a,b) # passing function object as an arguments
> g(f1,3,2)
[1] 5
> g(f2,3,2)
[1] 1
No Pointers in R:- R does not have variables corresponding to pointers or references like C language.
This can make programming more difficult in some cases.
The fundamental thought is to create a class constructor and have every instantiation of the
class be its own environment. One can then pass the object/condition into a function and it will be
passed by reference instead of by value, because unlike other R objects, environments are not copied
when passed to functions. Changes to the object in the function will change the object in the calling
frame. In this way, one can operate on the object and change internal elements without having to create
a copy of the object when the function is called, nor pass the entire object back from the function. For
large objects, this saves memory and time.
For example, you cannot write a function that directly changes its arguments.
If a function has several output then a solution is to gather them together into a list, call the function
with this list as an argument, have the function return the list, and then reassign to the original list.
An example is the following function, which determines the indices of odd and even numbers in a
vector of integers:
> y <-function(v){
odds <- which(v %% 2 == 1)
evens <- which(v %% 2 == 0)
list(o=odds,e=evens)
}
> y(c(2,34,1,5))
$o
[1] 3 4
$e
[1] 1 2
Recursion:- Recursion is a programming technique in which, a function calls itself repeatedly for some
input.
STATISTICS WITH R PROGRAMMING Unit - 2
r = fact(4)
24
return(4*(fact(3))
6
return(3*(fact(2))
2
return(2*(fact(1))
1
return(1)
A Quicksort Implementation:- Quick sort is also known as Partition-Exchange sort and is based on
Divide and conquer Algorithm design method. This was proposed by C.A.R Hoare. The basic idea of
quick sort is very simple. We consider one element at a time (pivot element). We have to move the pivot
element to the final position that it should occupy in the final sorted list. While identifying this position,
we arrange the elements, such that the elements to the left of the pivot element will be less than pivot
element & elements to the right of the pivot element will be greater than pivot element. There by
dividing the list by 2 parts. We have to apply quick sort on these 2 parts recursively until the entire list is
sorted.
For instance, suppose we wish to sort the vector (5,4,12,13,3,8,88). We first compare everything to the
first element, 5, to form two subvectors: one consisting of the elements less than 5 and the other
consisting of the elements greater than or equal to 5. That gives us subvectors (4,3) and (12,13,8,88). We
then call the function on the subvectors, returning (3,4) and (8,12,13,88). We string those together with
the 5, yielding (3,4,5,8,12,13,88), as desired. R‘s vector-filtering capability and its c() function make
implementation of Quicksort quite easy.
Binary search tree:- The nature of binary search trees implies that at any node, all of the elements in
the node‘s left subtree are less than or equal to the value stored in this node, while the right subtree
stores the elements that are larger than the value in this mode. In our example tree, where the root
node contains 8, all of the values in the left subtree-5, 2 and 6-are less than 8, while 20 is greater than 8.
The code follows. Note that it includes only routines to insert new items and to traverse the tree.
# storage is in a matrix, say m, one row per node of the tree; a link i in the tree means the vector
#m[i,] = (u,v,w); u and v are the left and right links, and w is the stored value; null links have the value
#NA; the matrix is referred to as the list (m,nxt,inc), where m is the matrix, nxt is the next empty row to
#be used, and inc is the number of rows of expansion to be allocated when the matrix becomes full
# inserts newval into nonempty tree whose head is index hdidx in the storage space treeloc; note that
#return value must be reassigned to tree; inc is as in newtree() above
ins <- function(hdidx,tree,newval,inc) {
tr <- tree
# check for room to add a new element
tr$nxt <- tr$nxt + 1
if (tr$nxt > nrow(tr$mat))
tr$mat <- rbind(tr$mat,matrix(rep(NA,inc*3),nrow=inc,ncol=3))
newidx <- tr$nxt # where we'll put the new tree node
tr$mat[newidx,3] <- newval
idx <- hdidx # marks our current place in the tree
node <- tr$mat[idx,]
nodeval <- node[3]
while (TRUE) {
# which direction to descend, left or right?
if (newval <= nodeval) dir <- 1 else dir <- 2
# descend
# null link?
if (is.na(node[dir])) {
tr$mat[idx,dir] <- newidx
break
} else {
idx <- node[dir]
node <- tr$mat[idx,]
nodeval <- node[3]
}
}
return(tr)
}
STATISTICS WITH R PROGRAMMING Unit - 2
sapply( ):- sapply is wrapper class to lapply with difference being it returns vector or matrix instead of
list object.
Syntax: sapply(X, FUN, ...,)
# create a list with 2 elements
x = (a=1:10,b=11:20) # mean of values using sapply
sapply(x, mean)
a b
5.5 15.5
tapply( ):- tapply() applies a function or operation on subset of the vector broken down by a given factor
variable.
To understand this, imagine we have ages of 20 people (male/females), and we need to know the
average age of males and females from this sample. To start with we can group ages by the gender
(male or female), ages of 12 males, and ages of 8 females, and later calculate the average age for
males and females.
Syntax of tapply: tapply(X, INDEX, FUN, …)
X = a vector, INDEX = list of one or more factor, FUN = Function or operation that needs to be
applied, … optional arguments for the function
> ages <- c(25,26,55,37,21,42)
> affils <- c("R","D","D","R","U","D")
> tapply(ages,affils,mean)
DRU
41 31 21
STATISTICS WITH R PROGRAMMING Unit - 3
Doing Math and Simulation in R:- Math Function, Extended Example Calculating Probability- Cumulative
Sums and Products-Minima and Maxima- Calculus, Functions Fir Statistical Distribution, Sorting, Linear
Algebra Operation on Vectors and Matrices, Extended Example: Vector cross Product- Extended Example:
Finding Stationary Distribution of Markov Chains, Set Operation, Input /out put, Accessing the Keyboard and
Monitor, Reading and writer Files
Math Function : R contains built-in functions for various math operations and for
statistical distributions.
[1] 1 3 6
cumprod( ) Cumulative product of the elements of a > z <- c(2,5,3)
vector > cumprod(z)
[1] 2 10 30
round( ) Round of the closest integer > round(12.4)
[1] 12
> round(2.43,digits=1)
[1] 2.4
floor( ) Round of the closest integer below > floor(12.4)
[1] 12
ceiling( ) Round of the closest integer above > ceiling(12.4)
[1] 13
factorial( ) Factorial function > factorial(5)
[1] 120
sin(), cos(), tan() and so on: Trig functions, the arguments will be in radians,asin(), acos(), atan() inverse
trignometry functions.
> tan(45*pi/180)
[1] 1
> a<-tan(45*pi/180)
> b<-atan(a)
>b
[1] 0.7853982
> b*180/pi
[1] 45
sum(): sum returns the sum of all the values present in its arguments.
sum(..., na.rm = FALSE)
... : numeric or complex or logical vectors.
na.rm : logical. Should missing values (including NaN) be removed?
Example1:- Example2:-
>x y <- c(2,3,NA,1)
[1] 4 -5 56 >sum(y)
> sum(x) [1] NA
[1] 55 >sum(y, na.rm=TRUE)
[1] 6
prod(): prod returns the product of all the values present in its arguments.
prod(..., na.rm = FALSE)
... : numeric or complex or logical vectors.
na.rm : logical. Should missing values (including NaN) be removed?
Example1:- Example 2:-
> x <- c(1,3,5) >y
>prod(x) [1] 2 3 NA 1
[1] 15 >prod(y)
[1] NA
> prod(y, na.rm=TRUE)
[1] 6
The probability can be calculated using the prod() function. Let us assume that there are ‗n‘ independent
events with the ith event having the pi probability of occurrence.
What is the probability of exactly one of these events occurring?
STATISTICS WITH R PROGRAMMING Unit - 3
Considering an example where the value of n is 3. The events are named A, B, and C. Then we
break down the computation as follows:
P(exactly one event occurs) = P(A and not B and not C) +
P(not A and B and not C) +
P(not A and not B and C)
P(A and not B and not C) would be pA (1 − pB) (1 − pC), and so on.
For general n, that is calculated as follows
n
p (1 p )....(1 p
i1
i 1 i1 )(1 pi1 ).... (1 pn )
(The ith term inside the sum is the probability that event i occurs and all the others do not occur.)
Here‘s code to compute this, with our probabilities pi contained in the vector p:
exactlyone <- function(p) {
notp <- 1 - p
tot <- 0.0
for (i in 1:length(p))
tot <- tot + p[i] * prod(notp[-i])
return(tot)
}
notp <- 1 – p :- creates a vector of all the ―not occur‖ probabilities 1 − pj , using recycling.
The expression notp[-i] computes the product of all the elements of notp, except the ith
[1] 4
> min(c(12,4,6,NA,34),na.rm=FALSE)
[1] NA
> x <- c(2,-4,6,-34)
> max(x[2],x[3])
[1] 6
which.min() and which.max(): Index of the minimal element and maximal element of a vector.
>x <- c(1,4,-423,8,-2,23)
> which.min(x)
[1] 3
> which.max(x)
[1] 6
Function minimization/maximization can be done via nlm() and optim(). For example, let‘s find the
smallest value of f(x) = x2 − sin(x).
> nlm(function(x) return(x^2-sin(x)),8)
$minimum
[1] -0.2324656
$estimate
[1] 0.4501831
$gradient
[1] 4.024558e-09
$code
[1] 1
$iterations
[1] 5
Here, the minimum value was found to be approximately −0.23, occurring at x = 0.45. A Newton-
Raphson method (a technique from numerical analysis for approximating roots) is used, running five
iterations in this case. The second argument specifies the initial guess, which we set to be 8.
Calculus:- R also has some calculus capabilities, including symbolic differentiation and numerical
integration.
> D(expression(exp(x^2)),"x") # derivative
exp(x^2) * (2 * x)
STATISTICS WITH R PROGRAMMING Unit - 3
Functions Fir Statistical Distribution:- R has functions available for most of the famous statistical
distributions.
Prefix the name as follows:
• With d for the density or probability mass function (pmf)
• With p for the cumulative distribution function (cdf)
• With q for quantiles
• With r for random number generation
The rest of the name indicates the distribution. Table 8-1 lists some common statistical distribution
functions.
Distribution Density/pmf cdf Quantiles Random numbers
Normal dnorm() pnorm() qnorm() rnorm()
Chi square dchisq() pchisq() qchisq() rchisq()
Binomial dbinom() pbinom() qbinom() rbinom()
As an example, simulate 1,000 chi-square variates with 2 degrees of freedom and find their mean.
> mean(rchisq(1000,df=2))
[1] 1.938179
The r in rchisq specifies that we wish to generate random numbers— in this case, from the chi-square
distribution. As seen in this example, the first argument in the r-series functions is the number of random
variates to generate.
These functions also have arguments specific to the given distribution families. In our example, we
use the df argument for the chi-square family, indicating the number of degrees of freedom.
Let‘s also compute the 95th percentile of the chi-square distribution with two degrees of freedom:
> qchisq(0.95,2)
[1] 5.991465
Here, we used q to indicate quantile—in this case, the 0.95 quantile, or the 95th percentile. The first
argument in the d, p, and q series is actually a vector so that we can evaluate the density/pmf, cdf, or
quantile function at multiple points. Let‘s find both the 50th and 95th percentiles of the chi-square
distribution with 2 degrees of freedom.
qchisq(c(0.5,0.95),df=2)
[1] 1.386294 5.991465
Sorting:- Sorting is nothing but storage of data in sorted order, it can be in ascending or descending order.
> x <- c(12,4,25,4)
> sort(x)
[1] 4 4 12 25
>x
[1] 12 4 25 4
The vector x did not change actually as printed in the very last line of the code. In order to sort the indexes
as such, the order function is used in the following manner.
> order(x)
[1] 2 4 1 3
The console represents that there are two smallest values in vector x. The third smallest value being x[1],
and so on. The same function order can also be used along with indexing for sorting data frames. This
function can also be used to sort the characters as well as numeric values.
Another function which specifies the rank of every single element present in a vector is called rank( )
STATISTICS WITH R PROGRAMMING Unit - 3
>x
[1] 12 4 25 4
> rank(x)
[1] 3.0 1.5 4.0 1.5
The above console demonstrates that the value 12 lies at rank 4th, which means that the 3rd smallest element
in x is 12. Now, 4 number appears two times in the vector x. So, the rank 1.5 is allocated to both the
numbers.
Example:- using order function on a dataframe.
> age <- c(12,4,34,14)
> names <- c("A","B","C","D")
> df <- data.frame(age,names)
> df
age names
1 12 A
2 4 B
3 34 C
4 14 D
> df[order(df$age),]
age names
2 4 B
1 12 A
4 14 D
3 34 C
Linear Algebra Operation on Vectors and Matrices:- The vector quantity can be multiplied to a scalar
quantity as demonstrated:
> x <- c(13,5,12,5)
> y <-2*x
>y
[1] 26 10 24 10
To compute the inner product (or dot product) of two vectors, use crossprod(),
> a<-c(3,7,2)
> b<-c(2,5,8)
> crossprod(a,b)
[,1]
[1,] 57
The function computed 3·2+7·5+ 2·8 = 57.
Note that the name crossprod() is a misnomer, as the function does not compute the vector cross product.
The function solve() will solve systems of linear equations and even find matrix inverses. For example, let‘s
solve this system:
x1 + x2 =2
STATISTICS WITH R PROGRAMMING Unit - 3
−x1 + x2 =4
1 1 x1 2
1 x 4
1 2
> a<-matrix(c(1,1,-1,1),ncol=2,byrow=T)
> b<-c(2,4)
> solve(a,b)
[1] -1 3
> solve(a)
[,1] [,2]
[1,] 0.5 -0.5
[2,] 0.5 0.5
In the second call solve(), we are not giving second argument so it computers inverse of the matrix.
Few other linear algebra functions are,
t(): Matrix transpose
qr(): QR decomposition
chol(): Cholesky decomposition
det(): Determinant
eigen(): Eigen values/eigen vectors
diag(): Extracts the diagonal of a square matrix (useful for obtaining variances from a covariance
matrix and for constructing a diagonal matrix).
sweep(): Numerical analysis sweep operations
Note the versatile nature of diag(): If its argument is a matrix, it returns a vector, and vice versa. Also, if the
argument is a scalar, the function returns the identity matrix of the specified size.
> x<-matrix(1:9,ncol=3)
> diag(x)
[1] 1 5 9
> a<-c(1,2,3)
> diag(a)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
The sweep() function is capable of fairly complex operations. As a simple example, let‘s take a 3-by-3
matrix and add 1 to row 1, 4 to row 2, and 7 to row 3.
>a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> sweep(a,1,c(3,4,5),"+")
[,1] [,2] [,3]
[1,] 4 7 10
[2,] 6 9 12
[3,] 8 11 14
> sweep(a,2,c(3,4,5),"+")
[,1] [,2] [,3]
[1,] 4 8 12
[2,] 5 9 13
[3,] 6 10 14
STATISTICS WITH R PROGRAMMING Unit - 3
The first two arguments to sweep() are like those of apply(): the array and the margin, which is 1 for rows
in this case. The fourth argument is a function to be applied, and the third is an argument to that function.
Vector Cross Product:- Let‘s consider the issue of vector cross products. The definition is very simple:
The cross product of vectors (x 1, x2, x3) and (y1, y2, y3) in three dimensional space is a new three-
dimensional vector, as (x2y3 − x3y2, −x1y3 + x3y1, x1y2 − x2y1)
This can be expressed compactly as the expansion along the top row of the determinant, Here, the
elements in the top row are merely placeholders.
x1 x2 x3
y y y
1 2 3
The point is that the cross product vector can be computed as a sum of subdeterminants. For
instance, the first component in Equation 8.1, x2y3 − x3y2, is easily seen to be the determinant of
the submatrix obtained by deleting the first row and first column.
x2 x
3
y2 y3
Function to calculate cross product of vectors:
xprod <- function(x,y)
{
m <- rbind(rep(NA,3),x,y)
xp <- vector(length=3)
for (i in 1:3)
xp[i] <- -(-1)^i * det(m[2:3,-i])
return(xp)
}
> xprod(c(12,4,2),c(2,1,1))
[1] 2 -8 4
3) setdiff(x,y): The set difference of any two sets A and B is the set of
elements that belongs to A but not B. It is denoted by A-B and read as
‗A difference B‘. A-B is also denoted by A|B or A~B. It is also called
the relative complement of B in A.
Eg: A = {1,2,3,4,5,6} B = {3,5,7,9}
A-B = {1,2,4,6}
B-A = {7,9}
4) setequal(x,y): Test for equality between x and y. If both x and y are equal it returns TRUE
otherwise returns FALSE
5) c %in% y: Membership, testing whether c is an element of the set y. It checks every
corresponding element of ‗c‘ with ‗y‘,if both elements are equal it returns TRUE else return FALSE.
6) choose(n,r): Number of possible subsets of size k chosen from a set of size n
Eg:- > choose(2,1)
[1] 2
choose() function computes the combination nCr.
n: n elements
r: r subset elements
...
nCr = n!/(r! * (n-r)!)
? Code the symmetric difference between two sets— that is, all the elements belonging to exactly one of
the two operand sets. Because the symmetric difference between sets x and y consists exactly of those
elements in x but not y and vice versa.
function(a,b) >x
{ [1] 1 2 5
sdfxy <- setdiff(x,y) >y
sdfyx <- setdiff(y,x) [1] 5 1 8 9
return(union(sdfxy,sdfyx)) > symdiff(x,y)
} [1] 2 8 9
? Write a binary operand for determining whether one set u is a subset of another set v.
Hint: A bit of thought shows that this property is equivalent to the intersection of u and v being equal
to u.
"%subsetof%" <- function(u,v)
{
return(setequal(intersect(u,v),u))
}
> c(2,8) %subsetof% 1:10
[1] TRUE
> c(12,8) %subsetof% 1:10
[1] FALSE
combn() :-The function combn() generates combinations. Let‘s find the subsets of {1,2,3} of size 2.
> x <- combn(1:3,2)
>x
STATISTICS WITH R PROGRAMMING Unit - 3
Input /output:- I/O plays a central role in most real-world applications of computers. Just consider an
ATM cash machine, which uses multiple I/O operations for both input—reading your card and reading
your typed-in cash request—and output—printing instructions on the screen, printing your receipt, and
most important, controlling the machine to output your money!
R is not the tool you would choose for running an ATM, but it features a highly versatile
array of I/O capabilities.
1. Accessing the keyboard and monitor:- R provides several functions for accesssing the keyboard and
monitor. Few of them are scan(), readline(), print(), and cat() functions.
Using the scan( ) Function:-You can use scan() to read in a vector or a list, from a file or the keyboard.
Suppose we have files named z1.txt, z2.txt.
z1.txt contains the following
123
45
6
z2.txt contains the follwing
abc
de f
g
> scan("z1.txt")
Read 4 items
[1] 123 4 5 6
> scan("z2.txt")
Error in scan("z2.txt") : scan() expected 'a real', got 'abc‘
> scan("z2.txt",what="")
Read 4 items
[1] "abc" "de" "f" "g"
The scan() function has an optional argument named what, which specifies mode, defaulting to
double mode. So, the non-numeric contents of the file z2 produced an error. But we then tried again,
with what="". This assigns a character string to what, indicating that we want character mode.
By default, scan() assumes that the items of the vector are separated by whitespace, which includes
blanks, carriage return/line feeds, and horizontal tabs. You can use the optional sep argument for
other situations.
You can use scan() to read from the keyboard by specifying an empty string for the filename:
> scan("")
1: 43 23 65 12
5:
Read 4 items
[1] 43 23 65 12
> scan("",what="")
1: ―x" ―y" ―z" "srikanth" "Preethi" ―omer"
7:
Read 6 items
[1] ―x" ―y" ―z" "srikanth" "Preethi" ―omer"
STATISTICS WITH R PROGRAMMING Unit - 3
readline() function:- If you want to read in a single line from the keyboard, readline() is very handy.
readline() is called with its optional prompt.
> readline()
Hai how are u
[1] "Hai how are u―
> readline("Enter your Name")
Enter your Name VIT
[1] ―VIT"
Printing to the Screen:- At the top level of interactive mode, you can print the value of a variable or
expression by simply typing the variable name or expression. This won‘t work if you need to print
from within the body of a function. In that case, you can use the print() function,
> x <- 1:3
> print(x^2)
[1] 1 4 9
print() is a generic function, so the actual function called will depend on the class of the object that is
printed. If, for example, the argument is of class "table", then the print.table() function will be called.
It‘s a little better to use cat() instead of print(), as the latter can print only one expression and its
output is numbered, which may be a nuisance. Compare the results of the functions:
> print("abc")
[1] "abc"
> cat("abc\ndef")
abc
def
Note that we needed to supply our own end-of-line character, "\n", in the call to cat(). Without it, our
next call would continue to write to the same line. The arguments to cat() will be printed out with
intervening spaces:
>x
[1] 1 2 3
> cat(x,"abc","de\n")
1 2 3 abc de
If you don‘t want the spaces, set sep to the empty string "", as follows:
> cat(x,"abc","de\n",sep="")
123abcde
Any string can be used for sep. Here, we use the newline character:
>cat(x,"abc","de\n",sep="\n")
1
2
3
abc
de
Set sep can be used with a vector of strings, like this:
> x <- c(5,12,13,8,88)
> cat(x,sep=c(".",".",".","\n","\n"))
5.12.13.8
88
2. Reading and Writing Files:- It includes reading data frames or matrices from files, working with text
files, accessing files on remote machines, and getting file and directory information.
Reading a Data Frame or Matrix from a File:- read.table() is used to read a data frame from the file.
> read.table("z1.txt",header=TRUE)
name nature
1 Hemant Obidient
2 Sowjanya Hardworking
STATISTICS WITH R PROGRAMMING Unit - 3
3 Girija Friendly
4 Preethi Calm
scan() would not work here, as our data-frame has mixture of character and numeric data.We can
read a matrix using scan as
Mat<-matrix(scan(“abc.txt”), nrow=2, ncol=2, byrow=T)
We can do this generally by using read.table() as
read.matrix<-function(filename){
as.matrix(read.table(filename))
}
Reading a Text-File: readLines() is used to read in a text file, either one line at a time or in a single
operation. For example, suppose we have a file z1 with the following contents:
John 25
Mary 28
Jim 19
We can read the file all at once, like this:
>z1 <- readLines("z1")
>z1
[1] "John 25" "Mary 28" "Jim 19"
Since each line is treated as a string, the return value here is a vector of strings—that is, a vector of
character mode.
There is one vector element for each line read, thus three elements here. Alternatively,
we can read it in one line at a time. For this, we first need to create a connection, as described next.
Introduction to Connections: Connection is R‘s term for a fundamental mechanism used in various
kinds of I/O operations. The connection is created by calling file(), url(), or one of several other R
functions. ?connection
> c <- file("z1","r")
> readLines(c,n=1)
[1] "John 25"
> readLines(c,n=1)
[1] "Mary 28"
> readLines(c,n=1)
[1] "Jim 19"
> readLines(c,n=1)
character(0)
We opened the connection, assigned the result to c, and then read the file one line at a time, as
specified by the argument n=1. When R encountered the end of file (EOF), it returned an empty
result.We needed to set up a connection so that R could keep track of our position in the file as we
read through it.
c <- file("z","r")
while(TRUE)
{ OUTPUT:
rl <- readLines(c,n=1) [1] "John 25"
if (length(rl) == 0) [1] "Mary 28"
{ [1] "Jim 19"
print("reached the end") [1] "reached the end"
break
} else print(rl)
}
Accessing files on remote machines via urls: Certain I/O functions, such as read.table() and scan(),
accept web URLs as arguments.
uci <- "http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/ echocardiogram. data”
> ecc <- read.csv(uci)
STATISTICS WITH R PROGRAMMING Unit - 3
Writing to a file:The function write.table() works very much like read.table(), except that it writes a
data frame instead of reading one.
> kids <- c("Jack","Jill")
> ages <- c(12,10)
> d <- data.frame(kids,ages,stringsAsFactors=FALSE)
>d kids ages
1 Jack 12
2 Jill 10
> write.table(d,"kds.txt")
In the case of writing a matrix to a file, just state that you do not want row or column names, as
follows:
write.table(xc, "xcnew", row.names=FALSE, col.names=FALSE)
The function cat() can also be used to write to a file, one part at a time.
> cat("abc\n",file="u")
> cat("de\n",file="u",append=TRUE)
The first call to cat() creates the file u, consisting of one line with contents "abc". The second call
appends a second line. The file is automatically saved after each operation.
writeLines() function can also be used, the counterpart of readLines(). If you use a connection, you
must specify "w" to indicate you are writing to the file, not reading from it:
> c <- file("www","w")
> writeLines(c("abc","de","f"),c)
> close(c)
The file www will be created with these contents:
abc
de
f
STATISTICS WITH R PROGRAMMING
Problem: Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the number of
successes is equal to 2, and the probability of success on a single trial is 1/6 or about 0.167.
Therefore, the binomial probability is:
b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3
b(2; 5, 0.167) = 0.161
R Code:
> dbinom(2, size=5, prob=0.167)
[1] 0.1612
STATISTICS WITH R PROGRAMMING Unit - V
Problem: In a restaurant seventy percent of people order for Chinese food and thirty percent for Italian food. A
group of three persons enter the restaurant. Find the probability of at least two of them ordering for Italian food.
Solution:-
The probability of ordering Chinese food is 0.7 and the probability of ordering Italian food is 0.3. Now, if
at least two of them are ordering Italian food then it implies that either two or three will order Italian
food.
Cumulative Binomial Probability:- A cumulative binomial probability refers to the probability that the
binomial random variable falls within a specified range (e.g., is greater than or equal to a stated lower
limit and less than or equal to a stated upper limit).
Problem: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five
possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a
student attempts to answer every question at random.
Solution:
Since only one out of five possible answers is correct, the probability of
answering a question correctly by random is 1/5=0.2.
To find the probability of having exactly 4 correct answers by
random attempts as follows.
> dbinom(4, size=12, prob=0.2)
[1] 0.1329
To find the probability of having four or less correct answers by random attempts, we
apply the function dbinom with x = 0,…,4.
> dbinom(0, size=12, prob=0.2) + dbinom(1, size=12, prob=0.2) +
+ dbinom(2, size=12, prob=0.2) + dbinom(3, size=12, prob=0.2) +
+ dbinom(4, size=12, prob=0.2)
[1] 0.9274
Alternatively, we can use the cumulative probability function for binomial
distribution pbinom.
STATISTICS WITH R PROGRAMMING Unit - V
Problem: Fit an appropriate binomial distribution and calculate the theoretical distribution
x: 0 1 2 3 4 5
f: 2 14 20 34 22 8
Solution:
Here n = 5 , N = 100
Mean = ∑ xi fi = 2.84
∑ fi
np = 2.84
p = 2.84/5 = 0.568
q = 0.432
> plot(fitbin)
Examples:
1. The number of defective electric bulbs manufactured by a reputed company.
2. The number of telephone calls per minute at a switch board
3. The number of cars passing a certain point in one minute.
4. The number of printing mistakes per page in a large text.
R has four in-built functions to generate binomial distribution. They are described below.
dpois(x, lambda, log = FALSE) :- This function gives the probability density distribution at each
point.
ppois(q, lambda, lower.tail = TRUE, log.p = FALSE) :- This function gives the cumulative
probability of an event. It is a single value representing the probability.
qpois(p, lambda, lower.tail = TRUE, log.p = FALSE):- This function takes the probability value
and gives a number whose cumulative value matches the probability value.
rpois(n, lamda) :- This function generates required number of random values of given probability
from a given sample.
Following is the description of the parameters used −
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.
Problem:- If there are twelve cars crossing a bridge per minute on average, find the probability of having
seventeen or more cars crossing the bridge in a particular minute.
Solution:-
The probability of having sixteen or less cars crossing the bridge in a
particular minute is given by the function ppois.
> ppois(16, lambda=12) # lower tail
[1] 0.89871
Hence the probability of having seventeen or more cars crossing the
bridge in a minute is in the upper tail of the probability density function.
> ppois(16, lambda=12, lower=FALSE) # upper tail
[1] 0.10129
Answer:- If there are twelve cars crossing a bridge per minute on average, the probability of
having seventeen or more cars crossing the bridge in a particular minute is 10.1%.
Problem:- The average number of homes sold by the Acme Realty company is 2 homes per day. What is the
probability that exactly 3 homes will be sold tomorrow?
Solution: This is a Poisson experiment in which we know the following:
μ = 2; since 2 homes are sold per day, on average.
x = 3; since we want to find the likelihood that 3 homes will be
sold tomorrow.
e = 2.71828; since e is a constant equal to approximately
2.71828.
We plug these values into the Poisson formula as follows:
STATISTICS WITH R PROGRAMMING Unit - V
Cumulative Poisson Probability:- A cumulative Poisson probability refers to the probability that the
Poisson random variable is greater than some specified lower limit and less than some specified upper
limit.
Problem:-Suppose the average number of lions seen on a 1-day safari is 5.
What is the probability that tourists will see fewer than four lions on the next
1-day safari?
Solution: This is a Poisson experiment in which we know the following:
μ = 5; since 5 lions are seen per safari, on average.
x = 0, 1, 2, or 3; since we want to find the likelihood that
tourists will see fewer than 4 lions; that is, we want the
probability that they will see 0, 1, 2, or 3 lions.
e = 2.71828; since e is a constant equal to approximately
2.71828.
To solve this problem, we need to find the probability that tourists will see 0, 1, 2, or 3 lions. Thus,
we need to calculate the sum of four probabilities: P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5). To compute
this sum, we use the Poisson formula:
P(x < 3, 5) = P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5)
P(x < 3, 5) = [ (e-5)(50) / 0! ] + [ (e-5)(51) / 1! ] + [ (e-5)(52) / 2! ] + [ (e-5)(53) / 3! ]
P(x < 3, 5) = [ (0.006738)(1) / 1 ] + [ (0.006738)(5) / 1 ] + [ (0.006738)(25) / 2 ] +[ (0.006738)(125) / 6]
P(x < 3, 5) = [ 0.0067 ] + [ 0.03369 ] + [ 0.084224 ] + [ 0.140375 ]
P(x < 3, 5) = 0.2650
Thus, the probability of seeing at no more than 3 lions is 0.2650.
R Code:-
> ppois(3,lambda = 5)
[1] 0.2650259
, on the domain .
Standard Normal Distribution
It is the distribution that occurs when a normal random variable has a mean of zero and a standard
deviation of one.
The normal random variable of a standard normal distribution is called a standard score or a z score.
Every normal random variable X can be transformed into a z score via the following equation:
Z = (X - μ) / σ
where X is a normal random variable, μ is the mean, and σ is the standard deviation.
yielding
Standard Normal Curve:- One way of figuring out how data are
distributed is to plot them in a graph. If the data is evenly distributed,
you may come up with a bell curve. A bell curve has a small percentage
of the points on both tails and the bigger percentage on the inner part of
STATISTICS WITH R PROGRAMMING Unit - V
the curve. The shape of the standard normal distribution looks like
this:
R functions:
dnorm(x, mean = 0, sd = 1, log = FALSE) :- This function gives the probability density distribution
at each point.
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE):- This function gives the cumulative
probability of an event. It is a single value representing the probability.
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE):- This function takes the probability
value and gives a number whose cumulative value matches the probability value.
rnorm(n, mean = 0, sd = 1) :- This function generates required number of random values of given
probability from a given sample.
Problem:-X is a normally normally distributed variable with mean μ = 30 and standard deviation σ = 4. Find
a) P(x < 40)
b) P(x > 21)
c) P(30 < x < 35)
Solution:
a) For x = 40, then
z = x − µ /σ
⇒z = (40 – 30) / 4
= 2.5 (=z1 say)
Hence P(x < 40) = P(z < 2.5)
= 0.5+A(z1) = 0.9938
b) For x = 21,
z = x − µ /σ
⇒z = (21 - 30) / 4
= -2.25 (= -z1 say)
Hence P(x > 21) = P(z > -2.25)
= 0.5- A(z1) = 0.9878
c) For x = 30
z = x − µ /σ ⇒,
z = (30 - 30) / 4 = 0 and
for x = 35,
z = x − µ /σ
⇒ z = (35 - 30) / 4
= 1.25
Hence P(30 < x < 35) = P(0 < z < 1.25)
= [area to the left of z = 1.25] - [area to the left of 0]
= 0.8944 - 0.5 = 0.3944
Problem:-The length of life of an instrument produced by a machine has a normal ditribution with a mean of 12
months and standard deviation of 2 months. Find the probability that an instrument
produced by this machine will last.
a) less than 7 months.
b) between 7 and 12 months.
Solution:
a) P(x < 7)
for x = 7
z = x − µ /σ
⇒z = (7 – 12) / 2
= -2.5 (=z1 say)
Hence P(x < 7) = P(z < -2.5)
= 0.0062
b) P(7 < x < 12)
For x=12
z = x − µ /σ
⇒z = (12 – 12) / 2
= 0 (=z1 say)
Hence P(7 < x < 12) = P(-2.5 < z < 0)
= 0.4938
Problem:-The Tahoe Natural Coffee Shop morning customer load follows a normal
distribution with mean 45 and standard deviation 8. Determine the probability that the
number of customers tomorrow will be less than 42.
STATISTICS WITH R PROGRAMMING Unit - V
Solution:-
We first convert the raw score to a z-score. We have
z = x − µ /σ
⇒z =(42−45)/8=−0.375
Next, we use the table to find the probability. The table gives 0.3520. (We have rounded the raw score
to -0.38).
We can conclude that
P(x<42)=P(x<-0.38)
=0.352
That is there is about a 35% chance that there will be fewer than 42 customers tomorrow.
Example:
> x <- c(92,117,109,85,117,107,82,83,119,113,101,106,101,84,126,69,82,79,84,100,104,111,109,92,93,107,
81,118,81,133,111,82,120,103,115,89,74,110,83,110,96,102,108,110,140,106,111,98,98,99,74,101,107,104,
128,87,95,109,104,91,83,98,99,103,126,123,85,98,93,100)
Problem:-Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore,
the mean test score is 72, and the standard deviation is 15.2. What is the percentage of students scoring
84 or more in the exam?
Solution:-
We apply the function pnorm of the normal distribution with mean 72 and standard deviation
15.2. Since we are looking for the percentage of students scoring higher than 84, we are interested in
the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492
Problem:- The local ice cream shop keeps track of how much ice cream they sell versus the temperature
on that day, here are their figures for the last 12 days:
Temperature 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 17.2
oC
Ice cream $215 $325 $185 $332 $406 $522 $412 $614 $544 $421 $445 $408
sales
Solution:-
R Code:-
> temp <- c(14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1,22.6,17.2)
> sales <- c(215,325,185,332,406,522,412,614,544,421,445,408)
> corr_coeff <- cor(temp,sales)
> corr_coeff
[1] 0.9575066
> cov(temp,sales)
[1] 484.0932
#Adds a line of best fit to your scatter plot
> plot(temp, sales, pch=16,col="red")
>abline(lm(sales~temp),col="blue")
STATISTICS WITH R PROGRAMMING Unit - V
T-test for single mean:- One-sample t-test is used to compare the mean of a population to a
specified theoretical mean (μ).
Let X represents a set of values with size n, with mean μ and with standard deviation S.
The comparison of the observed mean (μ) of the population to a theoretical value μ is performed with
the formula below:
x 0
t
s n
To evaluate whether the difference is statistically significant, you first have to read in t test
table the critical value of Student’s t distribution corresponding to the significance level alpha of your
choice (5%). The degrees of freedom (df) used in this test are: df = n−1
Problem:-: A professor wants to know if her introductory statistics class has a good grasp of basic
math. Six students are chosen at random from the class and given a math proficiency test. The professor
wants the class to be able to score above 70 on the test. The six students get scores of 62, 92, 75, 68, 83,
and 95. Can the professor have 90 percent confidence that the mean score for the class on the test would
be above 70?
Solution:-
Null hypothesis: H 0: μ = 70
Alternative hypothesis: H a : μ > 70
First, compute the sample mean and standard deviation:
62 92 75 68 83 95
x
6
475
13.17
6
Null Hypothesis H0 : The sample meet upto standard i.e
µ >70 hours
Alternative Hypothesis HA: µ not greater than 70,
Level of Siginificance: 0.05
x 0
The test statistic is t
s n
STATISTICS WITH R PROGRAMMING Unit - V
79.71 70 9.17
t=
13.17 6 5.38
= 1.71(calculate value of t)
To test the hypothesis, the computed t‐value of 1.71 will be compared to the critical value in the t‐table
with 5 df is 1.67, the calculate of t is more than table value of t, so null hypothsis is rejected.
R code:-
> t.test(x,alternative="two.sided",mu=70)
One Sample t-test
data: x
t = 1.7053, df = 5, p-value = 0.1489
alternative hypothesis: true mean is not equal to 70
95 percent confidence interval:
65.34888 92.98446
sample estimates:
mean of x
79.16667
Problem:-: A Sample of 26 bulbs gives a mean life of 990 hours with S.D of 20 hours. The manufacurer
claims that the mean life of bulbs is 1000 hours. Is sample meet upto the standard.
Solution: Here n = 26,
Sample mean x̅ = 990 hours
S.D s = 20 hours
Population mean µ = 1000 hours
Df = n-1 = 26-1 = 25
Null Hypothesis H0: The sample meet upto standard i.e µ = 1000 hours
Alternative Hypothesis HA: µ not equal to 1000,
Level of Siginificance: 0.05
the test statistic is
x 0
t
s n
t = 990-1000/20/√26
= 2.5 (calculate value of t)
Table value of t with 25 df is 1.708
The calculate value of t is more than table value of t, so null hypotheis is rejected at 5% level.
Paired comparisons( Paired t-test ):- Sometimes data comes from non independent samples. An
example might be testing "before and after" of cosmetics or consumer products. We could use a single
random sample and do "before and after" tests on each person. A hypothesis test based on these data
would be called a paired comparisons test. Since the observations come in pairs, we can study the
difference, d, between the samples. The difference between each pair of measurements is called di.
Test statistic:- With a population of n pairs of measurements, forming a simple random sample from a
normally distributed population, the mean of the difference, d , is tested using the following
implementation of t.
d
t
S/ n
Problem :- The blood pressure of 5 women before and after intake of a certain drug are
given below: Test whether there is significant change in blood pressure at 1% level of
significance.
Before 110 120 125 132 125
After 120 118 125 136 121
1 n
S 2 (di d )2
n 1 i1
1 5
4 (di d )2
i1
1
[(10 1.6)2 (2 1.6)2 (0 1.6)2 (4 1.6)2 (4 1.6)2 ]
4
123.20
30.8
4
S 30.8 5.55
Test statistic: The test statistic is t which is calculated as
d
t
S/ n
1.16 0.645
5.55 / 5
Calculated |t| value is 0.645
Tabulates t0.01 with 5-1 = 4 degrees of freedom is 3.747.
Since calculated t < t0.01 , we accept the Null hypothesis and conclude that there is no significant
change in blood pressure.
R code:-
> x <- c(110,120,125,132,125)
> y <- c(120,118,125,136,121)
> t.test(x,y,paired=TRUE)
Paired t-test
data: x and y
t = -0.64466, df = 4,
p-value = 0.5543
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.490956 5.290956
sample estimates:
mean of the differences
-1.6
State the level of significance and find the critical value. The critical value, from the
student’s t-distribution, has the lesser of n1-1 and n2 -1 degrees of freedom.
Compute the test statistic.
Compare the test statistic to the critical value and state a conclusion.
x y
t ~ t n1 n2 -2
1 1
S
n1 n2
where
(x x)2 ( y i y)2
or S2 i
n s 2 n s 2
S 11
2 2 2
n1 n2 2 n1 n2 2
Problem:- Two horses A and B were tested according to the time (in seconds) to run a particular track
with the following results.
Horse A 28 30 32 33 33 29 34
Horse B 29 30 30 24 27 29
Test whether the two horses have the same running capacity.
n1 n2 2
(31.4358 26.8336)
5.23
762
Therefore S 5.23 2.3
Computation : t x y
31.286 - 28.16 2.443
1 1 1 1
S (2.3)
n1 n2 7 6
Tabulates t0.05 with 7+6-2 = 11 degrees of freedom at 5% level of significance is 2.2
Since calculated t > t0.05 , we reject the Null hypothesis and conclude that there is no significant change in
blood pressure.
Null hypothesis H0: There are no differences among the mean values of the groups being compared
(i.e., the group means are all equal)–
H0: µ1 = µ2 = µ3 = …= µk
Alternative hypothesis H1: (Conclusion if H0 rejected)?
Not all group means are equal (i.e., at least one group mean is different from the rest).
X cf
2
TSS = S2T ij
i j
Step 4: Treatment sum of squares
2
jT
TrSS = S2Tr N cf
Step 5: Error sum of squares
ESS = S2E = TSS-TrSS
Source of variable d.f Sum of Squares TSS F-Test
Treatment k-1 Tj2 STr 2 S 2Tr
(between sample) S 2Tr cf N S2Tr
k 1
Fcal
S 2E
Error n-k S2E = TSS-TrSS S 2E
S E
2
nk